diff --git a/README.md b/README.md index 922884a..9e496b5 100644 --- a/README.md +++ b/README.md @@ -1 +1,26 @@ -# skill-temporal-dev +# Temporal Development Skill + +A comprehensive skill for building Temporal applications. + +## Installation + +### As a Claude Code Plugin + +1. Run `/plugin marketplace add temporalio/agent-skills` +2. Run `/plugin` to open the plugin manager +3. Select **Marketplaces** +4. Choose `temporal-marketplace` from the list +5. Select **Enable auto-update** or **Disable auto-update** +6. run `/plugin install temporal-developer@temporalio-agent-skills` +7. Restart Claude Code + +### Via `npx skills` - supports all major coding agents + +1. `npx skills add temporalio/skill-temporal-developer` +2. Follow prompts + +### Via manually cloning the skill repo: + +1. `mkdir -p ~/.claude/skills && git clone https://github.com/temporalio/skill-temporal-developer ~/.claude/skills/temporal-developer` + +Appropriately adjust the installation directory based on your coding agent. \ No newline at end of file diff --git a/SKILL.md b/SKILL.md new file mode 100644 index 0000000..6c36f07 --- /dev/null +++ b/SKILL.md @@ -0,0 +1,131 @@ +--- +name: temporal-developer +description: This skill should be used when the user asks to "create a Temporal workflow", "write a Temporal activity", "debug stuck workflow", "fix non-determinism error", "Temporal Python", "Temporal TypeScript", "workflow replay", "activity timeout", "signal workflow", "query workflow", "worker not starting", "activity keeps retrying", "Temporal heartbeat", "continue-as-new", "child workflow", "saga pattern", "workflow versioning", "durable execution", "reliable distributed systems", or mentions Temporal SDK development. +version: 1.0.0 +--- + +# Skill: temporal-developer + +## Overview + +Temporal is a durable execution platform that makes workflows survive failures automatically. This skill provides guidance for building Temporal applications in Python and TypeScript. + +## Core Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Temporal Cluster │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │ +│ │ Event History │ │ Task Queues │ │ Visibility │ │ +│ │ (Durable Log) │ │ (Work Router) │ │ (Search) │ │ +│ └─────────────────┘ └─────────────────┘ └────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + ▲ + │ Poll / Complete + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Worker │ +│ ┌─────────────────────────┐ ┌──────────────────────────────┐ │ +│ │ Workflow Definitions │ │ Activity Implementations │ │ +│ │ (Deterministic) │ │ (Non-deterministic OK) │ │ +│ └─────────────────────────┘ └──────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Components:** +- **Workflows** - Durable, deterministic functions that orchestrate activities +- **Activities** - Non-deterministic operations (API calls, I/O) that can fail and retry +- **Workers** - Long-running processes that poll task queues and execute code +- **Task Queues** - Named queues connecting clients to workers + +## History Replay: Why Determinism Matters + +Temporal achieves durability through **history replay**: + +1. **Initial Execution** - Worker runs workflow, generates Commands, stored as Events in history +2. **Recovery** - On restart/failure, Worker re-executes workflow from beginning +3. **Matching** - SDK compares generated Commands against stored Events +4. **Restoration** - Uses stored Activity results instead of re-executing + +**If Commands don't match Events = Non-determinism Error = Workflow blocked** + +| Workflow Code | Command | Event | +|--------------|---------|-------| +| Execute activity | `ScheduleActivityTask` | `ActivityTaskScheduled` | +| Sleep/timer | `StartTimer` | `TimerStarted` | +| Child workflow | `StartChildWorkflowExecution` | `ChildWorkflowExecutionStarted` | + +See `references/core/determinism.md` for detailed explanation. + +## Getting Started + +### Ensure Temporal CLI is installed + +Check if `temporal` CLI is installed. If not, follow these instructions: + +#### macOS + +``` +brew install temporal +``` + +#### Linux + +Check your machine's architecture and download the appropriate archive: + +- [Linux amd64](https://temporal.download/cli/archive/latest?platform=linux&arch=amd64) +- [Linux arm64](https://temporal.download/cli/archive/latest?platform=linux&arch=arm64) + +Once you've downloaded the file, extract the downloaded archive and add the temporal binary to your PATH by copying it to a directory like /usr/local/bin + +#### Windows + +Check your machine's architecture and download the appropriate archive: + +- [Windows amd64](https://temporal.download/cli/archive/latest?platform=windows&arch=amd64) +- [Windows arm64](https://temporal.download/cli/archive/latest?platform=windows&arch=arm64) + +Once you've downloaded the file, extract the downloaded archive and add the temporal.exe binary to your PATH. + +### Read All Relevant References + +1. First, read the getting started guide for the language you are working in: + - Python -> read `references/python/python.md` + - TypeScript -> read `references/typescript/typescript.md` +2. Second, read appropriate `core` and language-specific references for the task at hand. + + +## Primary References +- **`references/core/determinism.md`** - Why determinism matters, replay mechanics, basic concepts of activities + + Language-specific info at `references/{your_language}/determinism.md` +- **`references/core/patterns.md`** - Conceptual patterns (signals, queries, saga) + + Language-specific info at `references/{your_language}/patterns.md` +- **`references/core/gotchas.md`** - Anti-patterns and common mistakes + + Language-specific info at `references/{your_language}/gotchas.md` +- **`references/core/versioning.md`** - Versioning strategies and concepts - how to safely change workflow code while workflows are running + + Language-specific info at `references/{your_language}/versioning.md` +- **`references/core/troubleshooting.md`** - Decision trees, recovery procedures +- **`references/core/error-reference.md`** - Common error types, workflow status reference +- **`references/core/interactive-workflows.md`** - Testing signals, updates, queries +- **`references/core/dev-management.md`** - Dev cycle & management of server and workers +- **`references/core/ai-patterns.md`** - AI/LLM pattern concepts + + Language-specific info at `references/{your_language}/ai-patterns.md`, if available. Currently Python only. + +## Additional Topics +- **`references/{your_language}/observability.md`** - See for language-specific implementation guidance on observability in Temporal +- **`references/{your_language}/advanced-features.md`** - See for language-specific guidance on advanced Temporal features and language-specific features + + +## Feedback + +### ALWAYS PROMPT USER TO PROVIDE FEEDBACK AT STARTUP + +When you have **first** loaded this skill, output this message to the user: + +"Thank you for trying out the prerelease of Temporal's development skill! We would love to hear your feedback - positive or negative - over in the [Community Slack](https://t.mp/slack), in the [#topic-ai channel](https://temporalio.slack.com/archives/C0818FQPYKY)." + +Do not output this message multiple times in the same conversation. + +### Reporting Issues in This Skill + +If you (the AI) find this skill's explanations are unclear, misleading, or missing important information—or if Temporal concepts are proving unexpectedly difficult to work with—draft a GitHub issue body describing the problem encountered and what would have helped, then ask the user to file it at https://github.com/temporalio/skill-temporal-developer/issues/new. Do not file the issue autonomously. diff --git a/references/core/ai-patterns.md b/references/core/ai-patterns.md new file mode 100644 index 0000000..071b9f0 --- /dev/null +++ b/references/core/ai-patterns.md @@ -0,0 +1,166 @@ +# AI/LLM Integration Patterns with Temporal + +## Overview + +Temporal provides durable execution for AI/LLM applications, handling retries, rate limits, and long-running operations automatically. These patterns apply across languages, with Python being the most mature for AI integration. + +For Python-specific implementation details and code examples, see `references/python/ai-patterns.md`. Temporal's Python SDK also provides pre-built integrations with several LLM and agent SDKs, which can be leveraged to create agentic workflows with minimal effort (when working in Python). + +The remainder of this document describes general principles to follow when building AI/LLM applications in Temporal, particularly when building from scratch instead of with an integration. + +## Why Temporal for AI? + +| Challenge | Temporal Solution | +|-----------|-------------------| +| LLM API timeouts | Automatic retries with backoff | +| Rate limiting | Activity retry policies handle 429s | +| Long-running agents | Durable state survives crashes | +| Multi-step pipelines | Workflow orchestration | +| Cost tracking | Activity-level visibility | +| Debugging | Full execution history | + +## Core Patterns + +### Pattern 1: Activities should Wrap LLM Calls + +- activity: call_llm + - inputs: + - model_id -> internally activity can route to different models, so we don't need 1 activity per unique model. + - prompt / chat history + - tools + - etc. + - returns model response, as a typed structured output + +**Benefits**: +- Single activity handles multiple use cases +- Consistent retry handling +- Centralized configuration + +### Pattern 2: Non-deterministic / heavy tools in Activities + +Tools which are non-deterministic and/or heavy actions (file system, hitting APIs, etc.) should be placed in activities: + +``` +Workflow: + ├── Activity: call_llm (get tool selection) + ├── Activity: execute_tool (run selected tool) + └── Activity: call_llm (interpret results) +``` + +**Benefits**: +- Independent retry for each step +- Clear audit trail in history +- Easier testing and mocking +- Failure isolation + +### Pattern 3: Tools that Mutate Agent State can be in the Workflow directly + +Generally, agent state is in bijection with workflow state. Thus, tools which mutate agent state and are deterministic (like TODO tools, just updating a hash map) typically belong in the workflow code rather than an activity. + +``` +Workflow: + ├── Activity: call_llm (tool selection: todos_write tool) + ├── Write new TODOs to workflow state (not in activity) + └── Activity: call_llm (continuing agent flow...) +``` + +### Pattern 4: Centralized Retry Management + +Disable retries in LLM client libraries, let Temporal handle retries. + +- LLM Client Config: + - max_retries = 0 ← Disable client retries at the LLM client level + +Use either the default activity retry policy, or customize it as needed for the situation. + +**Why**: +- Temporal retries are durable (survive crashes) +- Single retry configuration point +- Better visibility into retry attempts +- Consistent backoff behavior + + +### Pattern 5: Multi-Agent Orchestration + +Complex pipelines with multiple specialized agents: + +``` +Deep Research Example: + │ + ├── Planning Agent (Activity) + │ └── Output: subtopics to research + │ + ├── Query Generation Agent (Activity) + │ └── Output: search queries per subtopic + │ + ├── Parallel Web Search (Multiple Activities) + │ └── Output: search results (resilient to partial failures) + │ + └── Synthesis Agent (Activity) + └── Output: final report +``` + +**Key Pattern**: Use parallel execution with `return_exceptions=True` to continue with partial results when some searches fail. + +## Approximate Timeout Recommendations + +| Operation Type | Recommended Timeout | +|----------------|---------------------| +| Simple LLM calls (GPT-4, Claude-3) | 30 seconds | +| Reasoning models (o1, o3, extended thinking) | 300 seconds (5 min) | +| Web searches | 300 seconds (5 min) | +| Simple tool execution | 30-60 seconds | +| Image generation | 120 seconds | +| Document processing | 60-120 seconds | + +**Rationale**: +- Reasoning models need time for complex computation +- Web searches may hit rate limits requiring backoff +- Fast timeouts catch stuck operations +- Longer timeouts prevent premature failures for expensive operations + +## Rate Limit Handling + +### From HTTP Headers + +Parse rate limit info from API responses: + +- Response Headers: + - Retry-After: 30 + - X-RateLimit-Remaining: 0 + +- Activity: + - If rate limited: + - Raise retryable error with a next retry delay + - Temporal handles the delay + +## Error Handling + +### Retryable Errors +- Rate limits (429) +- Timeouts +- Temporary server errors (500, 502, 503) +- Network errors + +### Non-Retryable Errors +- Invalid API key (401) +- Invalid input/prompt +- Content policy violations +- Model not found + +## Best Practices + +1. **Disable client retries** - Let Temporal handle all retries +2. **Set appropriate timeouts** - Based on operation type +3. **Separate activities** - One per logical operation +4. **Use structured outputs** - For type safety and validation +5. **Handle partial failures** - Continue with available results +6. **Monitor costs** - Track LLM calls at activity level +7. **Test with mocks** - Mock LLM responses in tests + +## Observability + +See `references/{your_language}/observability.md` for the language you are working in for documentation on implementing observability in Temporal. It is generally recommended to add observability for: +- Token usage, via activity logging +- any else to help track LLM usage and debug agentic flows, within moderation. + diff --git a/references/core/determinism.md b/references/core/determinism.md new file mode 100644 index 0000000..bf4f1ec --- /dev/null +++ b/references/core/determinism.md @@ -0,0 +1,116 @@ +# Determinism in Temporal Workflows + +This document provides a conceptual-level overview to determinism in Temporal. Additional language-specific determinism information is available at `references/{your_language}/determinism.md`. + +## Overview + +Temporal workflows must be deterministic because of **history replay** - the mechanism that enables durable execution. + +## Why Determinism Matters + +### The Replay Mechanism + +When a Worker needs to restore workflow state (after crash, cache eviction, or continuing after a long timer), it **re-executes the workflow code from the beginning**. But instead of re-running external actions, it uses results stored in the Event History. + +``` +Initial Execution: + Code runs → Generates Commands → Server stores as Events + +Replay (Recovery): + Code runs again → Generates Commands → SDK compares to Events + If match: Use stored results, continue + If mismatch: NondeterminismError! +``` + +### Commands and Events + +Every workflow operation generates a Command that becomes an Event, here are some examples: + +| Workflow Code | Command Generated | Event Stored | +|--------------|-------------------|--------------| +| Execute activity | `ScheduleActivityTask` | `ActivityTaskScheduled` | +| Sleep/timer | `StartTimer` | `TimerStarted` | +| Child workflow | `StartChildWorkflowExecution` | `ChildWorkflowExecutionStarted` | +| Complete workflow | `CompleteWorkflowExecution` | `WorkflowExecutionCompleted` | + +### Non-Determinism Example + +``` +First Run (11:59 AM): + if datetime.now().hour < 12: → True + execute_activity(morning_task) → Command: ScheduleActivityTask("morning_task") + +Replay (12:01 PM): + if datetime.now().hour < 12: → False + execute_activity(afternoon_task) → Command: ScheduleActivityTask("afternoon_task") + +Result: Commands don't match history → NondeterminismError +``` + +## Sources of Non-Determinism + +### Time-Based Operations +- `datetime.now()`, `time.time()`, `Date.now()` +- Different value on each execution + +### Random Values +- `random.random()`, `Math.random()`, `uuid.uuid4()` +- Different value on each execution + +### External State +- Reading files, environment variables, databases, networking / HTTP calls +- State may change between executions + +### Non-Deterministic Iteration +- Map/dict iteration order (in some languages) +- Set iteration order + +### Threading/Concurrency +- Race conditions produce different outcomes +- Non-deterministic ordering + +## **Central Concept**: Place Non-Determinism within Activities + +In Temporal, activities are the primary mechanism for making non-deterministic code durable and persisted in workflow history. Generally speaking, you should place sources of non-determinism in activities, which provides durability and recording of results, as well as automated retries and more. See `references/{your_language}/{your_language}.md` for the language you are working in for how to do this in practice. + +For a few simple cases, like timestamps, random values, UUIDs, etc. the Temporal SDK in your language may provide durable variants that are simple to use. See `references/{your_language}/determinism.md` for the language you are working in for more info. + +## SDK Protection Mechanisms +Each Temporal SDK language provides a protection mechanism to make it easier to catch non-determinism errors earlier in development: + +- Python: The Python SDK runs workflows in a sandbox that intercepts and aborts non-deterministic calls at runtime. +- TypeScript: The TypeScript SDK runs workflows in an isolated V8 sandbox, intercepting many common sources of non-determinism and replacing them automatically with deterministic variants. + + +## Detecting Non-Determinism + +### During Execution +- `NondeterminismError` raised when Commands don't match Events +- Workflow becomes blocked until code is fixed + +### Testing with Replay + +Replay tests verify that workflows follow identical code paths when re-run, by attempting to replay recorded executions. See the replay testing section of `references/{your_language}/testing.md` for information on how to write these tests. + +## Recovery from Non-Determinism + +### Accidental Change +If you accidentally introduced non-determinism: +1. Revert code to match what's in history +2. Restart worker +3. Workflow auto-recovers + +### Intentional Change +If you need to change workflow logic: +1. Use the **Patching API** to support both old and new code paths +2. Or terminate old workflows and start new ones with updated code + +See `versioning.md` for patching details. + +## Best Practices + +1. **Use SDK-provided alternatives** for time, random, UUID +2. **Move I/O to activities** - workflows should only orchestrate +3. **Test with replay** before deploying workflow changes +4. **Use patching** for intentional changes to running workflows +5. **Keep workflows focused** - complex logic increases non-determinism risk diff --git a/references/core/dev-management.md b/references/core/dev-management.md new file mode 100644 index 0000000..01faed0 --- /dev/null +++ b/references/core/dev-management.md @@ -0,0 +1,26 @@ +# Development Server and Worker Management + +## Server Management + +Before starting workers or workflows, you MUST start a local dev server, using the Temporal CLI: + +```bash +temporal server start-dev # Start this in the background. +``` + +It is perfectly OK for this process to be shared across multiple projects / left running as you develop your Temporal code. + +## Worker Management Details + +### Starting Workers + +How you start a worker is project-dependent, but generally Temporal code should have a program entrypoint which starts a worker. If your project doesn't, you should define it. + +When you need a new worker, you should start it in the background (and preferrably have it log somewhere you can check), and then remember its PID so you can kill / clean it up later. + +**Best practice**: As far as local development goes, run only ONE worker instance with the latest code. Don't keep stale workers (running old code) around. + + +### Cleanup + +**Always kill workers when done.** Don't leave workers running. diff --git a/references/core/error-reference.md b/references/core/error-reference.md new file mode 100644 index 0000000..a0f905b --- /dev/null +++ b/references/core/error-reference.md @@ -0,0 +1,32 @@ +# Common Error Types Reference + +| Error Type | Error identifier (if any) | Where to Find | What Happened | Recovery | Link to additional info (if any) +|------------|---------------|---------------|---------------|----------|----------| +| **Non-determinism** | TMPRL1100 | `WorkflowTaskFailed` in history | Replay doesn't match history | Analyze error first. **If accidental**: fix code to match history → restart worker. **If intentional v2 change**: terminate → start fresh workflow. | https://github.com/temporalio/rules/blob/main/rules/TMPRL1100.md | +| **Deadlock** | TMPRL1101 | `WorkflowTaskFailed` in history, worker logs | Workflow blocked too long (deadlock detected) | Remove blocking operations from workflow code (no I/O, no sleep, no threading locks). Use Temporal primitives instead. | https://github.com/temporalio/rules/blob/main/rules/TMPRL1101.md | +| **Unfinished handlers** | TMPRL1102 | `WorkflowTaskFailed` in history | Workflow completed while update/signal handlers still running | Ensure all handlers complete before workflow finishes. Use `workflow.wait_condition()` to wait for handler completion. | https://github.com/temporalio/rules/blob/main/rules/TMPRL1102.md | +| **Payload overflow** | TMPRL1103 | `WorkflowTaskFailed` or `ActivityTaskFailed` in history | Payload size limit exceeded (default 2MB) | Reduce payload size. Use external storage (S3, database) for large data and pass references instead. | https://github.com/temporalio/rules/blob/main/rules/TMPRL1103.md | +| **Workflow code bug** | | `WorkflowTaskFailed` in history | Bug in workflow logic | Fix code → Restart worker → Workflow auto-resumes | | +| **Missing workflow** | | Worker logs | Workflow not registered | Add to worker.py → Restart worker | | +| **Missing activity** | | Worker logs | Activity not registered | Add to worker.py → Restart worker | | +| **Activity bug** | | `ActivityTaskFailed` in history | Bug in activity code | Fix code → Restart worker → Auto-retries | | +| **Activity retries** | | `ActivityTaskFailed` (count >2) | Repeated failures | Fix code → Restart worker → Auto-retries | | +| **Sandbox violation** | | Worker logs | Bad imports in workflow | Fix workflow.py imports → Restart worker | | +| **Task queue mismatch** | | Workflow never starts | Different queues in starter/worker | Align task queue names | | +| **Timeout** | | Status = TIMED_OUT | Operation too slow | Increase timeout config | | + +## Workflow Status Reference + +| Status | Meaning | Action | +|--------|---------|--------| +| `RUNNING` | Workflow in progress | Wait, or check if stalled | +| `COMPLETED` | Successfully finished | Get result, verify correctness | +| `FAILED` | Error during execution | Analyze error | +| `CANCELED` | Explicitly canceled | Review reason | +| `TERMINATED` | Force-stopped | Review reason | +| `TIMED_OUT` | Exceeded timeout | Increase timeout | + +## See Also + +- [Common Gotchas](gotchas.md) - Anti-patterns that cause these errors +- [Troubleshooting](troubleshooting.md) - Decision trees for diagnosing issues diff --git a/references/core/gotchas.md b/references/core/gotchas.md new file mode 100644 index 0000000..55b6ddb --- /dev/null +++ b/references/core/gotchas.md @@ -0,0 +1,196 @@ +# Common Temporal Gotchas + +Common mistakes and anti-patterns in Temporal development. Learning from these saves significant debugging time. + +This document provides a general overview of conceptual-level gotchas in Temporal. The exact form that these take and symptoms can vary by SDK language. See `references/{your_language}/gotchas.md` for language-specific info on common mistakes. + +## Non-Idempotent Activities + +**The Problem**: Activities may execute more than once due to retries or Worker failures. If an activity calls an external service without an idempotency key, you may charge a customer twice, send duplicate emails, or create duplicate records. + +**Symptoms**: +- Duplicate side effects (double charges, duplicate notifications) +- Data inconsistencies after retries + +**The Fix**: Always use idempotency keys when calling external services. Use the workflow ID, activity ID, or a domain-specific identifier (like order ID) as the key. + +**Note:** Local Activities skip the task queue for lower latency, but they're still subject to retries. The same idempotency rules apply. + +## Side Effects & Non-Determinism in Workflow Code + +**The Problem**: Code in workflow functions runs on first execution AND on every replay. Any side effect (logging, notifications, metrics, etc.) will happen multiple times and non-deterministic code (IO, current time, random numbers, threading, etc.) won't replay correctly. + +**Symptoms**: +- Non-determinism errors +- Sandbox violations, depending on SDK language +- Duplicate log entries +- Multiple notifications for the same event +- Inflated metrics + +**The Fix**: +- Use Temporal replay-aware managed side effects for common, non-business logic cases: + - Temporal workflow logging + - Temporal date time + - Temporal UUID generation + - Temporal random number generation +- Put all other side effects in Activities + +See `references/core/determinism.md` for more info. + +## Multiple Workers with Different Code + +**The Problem**: If Worker A runs part of a workflow with code v1, then Worker B (with code v2) picks it up, replay may produce different Commands. + +**Symptoms**: +- Non-determinism errors after deploying new code +- Errors mentioning "command mismatch" or "unexpected command" + +**The Fix**: +- Use Worker Versioning for production deployments +- Use patching APIs +- During development: kill old workers before starting new ones +- Ensure all workers run identical code + +**Note:** Workflows started with old code continue running after you change the code, which can then induce the above issues. During development (NOT production), you may want to terminate stale workflows (`temporal workflow terminate --workflow-id `). + +See `references/core/versioning.md` for more info. + +## Failing Activities Too Quickly + +**The Problem**: Using aggressive activity retry policies that give up too easily. + +**Symptoms**: +- Workflows failing on transient errors +- Unnecessary workflow failures during brief outages + +**The Fix**: Use appropriate activity retry policies. Let Temporal handle transient failures with exponential backoff. Reserve `maximum_attempts=1` for truly non-retryable operations. + +## Query Handler & Update Validator Mistakes + +### Modifying State in Queries & Update Validators + +**The Problem**: Queries and update validators are read-only. Modifying state causes non-determinism on replay, and must strictly be avoided. + +**Symptoms**: +- State inconsistencies after workflow replay +- Non-determinism errors + +**The Fix**: Queries and update validators must only read state. Use Updates for operations that need to modify state AND return a result. + +### Blocking in Queries & Update Validators + +**The Problem**: Queries and update validators must return immediately. They cannot await activities, child workflows, timers, or conditions. + +**Symptoms**: +- Query / update validators timeouts +- Deadlocks + +**The Fix**: Queries and update validators must only look at current state. Use Signals or Updates to trigger async operations. + +### Query vs Signal vs Update + +| Operation | Modifies State? | Returns Result? | Can Block? | Use For | +|-----------|-----------------|-----------------|------------|---------| +| **Query** | No | Yes | No | Read current state | +| **Signal** | Yes | No | Yes | Fire-and-forget mutations | +| **Update** | Yes | Yes | Yes | Mutations needing results | + +**Key rule**: Query to peek, Signal to push, Update to pop. + +## File Organization Issues + +Each SDK has specific requirements for how workflow and activity code should be organized. Mixing them incorrectly causes sandbox issues, bundling problems, or performance degradation. + +See language-specific gotchas for details. + +## Testing Mistakes + +### Only Testing Happy Paths + +**The Problem**: Not testing what happens when things go wrong. + +**Questions to answer**: +- What happens when an Activity exhausts all retries? +- What happens when a workflow is cancelled mid-execution? +- What happens during a Worker restart? + +**The Fix**: Test failure scenarios explicitly. Mock activities to fail, test cancellation handling, use replay testing. + +### Not Testing Replay Compatibility + +**The Problem**: Changing workflow code without verifying existing workflows can still replay. + +**Symptoms**: +- Non-determinism errors after deployment +- Stuck workflows that can't make progress + +**The Fix**: Use replay testing against saved histories from production or staging. + +## Error Handling Mistakes + +### Swallowing Errors + +**The Problem**: Catching errors without proper handling hides failures. + +**Symptoms**: +- Silent failures +- Workflows completing "successfully" despite errors +- Difficult debugging + +**The Fix**: Log errors and make deliberate decisions. Either re-raise, use a fallback, or explicitly document why ignoring is safe. + +### Wrong Retry Classification + +**The Problem**: Marking transient errors as non-retryable, or permanent errors as retryable. + +**Symptoms**: +- Workflows failing on temporary network issues (if marked non-retryable) +- Infinite retries on invalid input (if marked retryable) + +**The Fix**: +- **Retryable**: Network errors, timeouts, rate limits, temporary unavailability +- **Non-retryable**: Invalid input, authentication failures, business rule violations, resource not found + +## Cancellation Handling + +### Not Handling Workflow Cancellation + +**The Problem**: When a workflow is cancelled, cleanup code after the cancellation point doesn't run unless explicitly protected. + +**Symptoms**: +- Resources not released after cancellation +- Incomplete compensation/rollback +- Leaked state + +**The Fix**: Use language-specific cancellation scopes or try/finally blocks to ensure cleanup runs even on cancellation. See language-specific gotchas for implementation details. + +### Not Handling Activity Cancellation + +**The Problem**: Activities must opt in to receive cancellation. Without proper handling, a cancelled activity continues running to completion, wasting resources. + +**Requirements for activity cancellation**: +1. **Heartbeating** - Cancellation is delivered via heartbeat. Activities that don't heartbeat won't know they've been cancelled. +2. **Checking for cancellation** - Activity must explicitly check for cancellation or await a cancellation signal. + +**Symptoms**: +- Cancelled activities running to completion +- Wasted compute on work that will be discarded +- Delayed workflow cancellation + +**The Fix**: Heartbeat regularly and check for cancellation. See language-specific gotchas for implementation patterns. + +## Payload Size Limits + +**The Problem**: Temporal has built-in limits on payload sizes. Exceeding them causes workflows to fail. + +**Limits**: +- Max 2MB per individual payload +- Max 4MB per gRPC message +- Max 50MB for complete workflow history (aim for <10MB in practice) + +**Symptoms**: +- Payload too large errors +- gRPC message size exceeded errors +- Workflow history growing unboundedly + +**The Fix**: Store large data externally (S3/GCS) and pass references, use compression codecs, or chunk data across multiple activities. See the Large Data Handling pattern in `references/core/patterns.md`. diff --git a/references/core/interactive-workflows.md b/references/core/interactive-workflows.md new file mode 100644 index 0000000..3b02028 --- /dev/null +++ b/references/core/interactive-workflows.md @@ -0,0 +1,49 @@ +# Interactive Workflows + +Interactive workflows are workflows that use Temporal features such as signals or updates to pause and wait for external input. When testing and debugging these types of workflows you can send them input via the Temporal CLI. + +## Signals + +Fire-and-forget messages to a workflow. + +```bash +# Send signal to workflow +temporal workflow signal \ + --workflow-id \ + --name "signal_name" \ + --input '{"key": "value"}' +``` + +## Updates + +Request-response style interaction (returns a value). + +```bash +# Send update to workflow +temporal workflow update execute \ + --workflow-id \ + --name "update_name" \ + --input '{"approved": true}' +``` + +## Queries + +Read-only inspection of workflow state. + +```bash +# Query workflow state (read-only) +temporal workflow query \ + --workflow-id \ + --name "get_status" +``` + +## Typical Steps for Testing Interactive Workflows + +```bash +# 1. Start worker (command is project dependent) +# 2. Start workflow (command is project dependent) This code should output the workflow ID, if not, modify it to. +temporal workflow signal --workflow-id --name "signal_name" --input '{"key": "value"}' # 3. Send it interactive events, e.g. a signal. +# 4. Wait for workflow to complete (use Temporal CLI to check status) +# 5. Read workflow result, using the Temporal CLI +# 6. Cleanup the worker process if needed. +``` diff --git a/references/core/patterns.md b/references/core/patterns.md new file mode 100644 index 0000000..93f774d --- /dev/null +++ b/references/core/patterns.md @@ -0,0 +1,441 @@ +# Temporal Workflow Patterns + +## Overview + +Common patterns for building robust Temporal workflows. +See the language-specific references for the language you are working in: +- `references/{language}/{language}.md` for the root level documentation for that language +- `references/{language}/patterns.md` for language-specific example code of the patterns in this file. + +## Signals + +**Purpose**: Send data to a running workflow asynchronously (fire-and-forget). + +**When to Use**: +- Human approval workflows +- Adding items to a workflow's queue +- Notifying workflow of external events +- Live configuration updates + +**Characteristics**: +- Asynchronous - sender doesn't wait for response +- Can mutate workflow state +- Durable - signals are persisted in history +- Can be sent before workflow starts (signal-with-start) + +**Example Flow**: +``` +Client Workflow + │ │ + │──── signal(approve) ────▶│ + │ │ (updates state) + │ │ + │◀──── (no response) ──────│ +``` + +**Note:** A related but distinct pattern to signals is async activity completion. This is an advanced feature, which you may consider if the external system that would deliver the signal is unreliable and might fail to Signal, or +you want the external process to Heartbeat or receive Cancellation. If this may be the case, look at language-specific advanced features for your SDK language (`references/{your_language}/advanced-features.md`). + +## Queries + +**Purpose**: Read workflow state synchronously without modifying it. + +**When to Use**: +- Building dashboards showing workflow progress +- Health checks and monitoring +- Debugging workflow state +- Exposing current status to external systems + +**Characteristics**: +- Synchronous - caller waits for response +- Read-only - must not modify state +- Not recorded in history +- Executes on the worker, not persisted +- Can run even on completed workflows + +**Example Flow**: +``` +Client Workflow + │ │ + │──── query(status) ──────▶│ + │ │ (reads state) + │◀──── "processing" ───────│ +``` + +## Updates + +**Purpose**: Modify workflow state and receive a response synchronously. + +**When to Use**: +- Operations that need confirmation (add item, return count) +- Validation before accepting changes +- Replace signal+query combinations +- Request-response patterns within workflow + +**Characteristics**: +- Synchronous - caller waits for completion +- Can mutate state AND return values +- Supports validators to reject invalid updates before they even get persisted into history +- Recorded in history + +**Example Flow**: +``` +Client Workflow + │ │ + │──── update(addItem) ────▶│ + │ │ (validates, modifies state) + │◀──── {count: 5} ─────────│ +``` + +## Child Workflows + +**When to Use**: +- Prevent history from growing too large +- Isolate failure domains (child can fail without failing parent) +- Different retry policies for different parts + +**Characteristics**: +- Own history (doesn't bloat parent) +- Independent lifecycle options (ParentClosePolicy) +- Can be cancelled independently +- Results returned to parent + +**Parent Close Policies**: +- `TERMINATE` - Child terminated when parent closes (default) +- `ABANDON` - Child continues running independently +- `REQUEST_CANCEL` - Cancellation requested but not forced + +**Note:** Do not need to use child workflows simply for breaking complex logic down into smaller pieces. Standard programming abstractions within a workflow can already be used for that. + +## Continue-as-New + +**Purpose**: Prevent unbounded history growth by "restarting" with fresh history. + +**When to Use**: +- Long-running workflows (entity workflows, subscriptions) +- Workflows with many iterations +- When history approaches 10,000+ events +- Periodic cleanup of accumulated state + +**How It Works**: +``` +Workflow (history: 10,000 events) + │ + │ continueAsNew(currentState) + ▼ +New Workflow Execution (history: 0 events) + │ (same workflow ID, fresh history) + │ (receives currentState as input) +``` + +**Best Practice**: Check `historyLength` or `continueAsNewSuggested` periodically. + +## Saga Pattern + +**Purpose**: Distributed transactions with compensation for failures. + +**When to Use**: +- Multi-step operations that span services +- Operations requiring rollback on failure +- Financial transactions, order processing +- Booking systems with multiple reservations + +**How It Works**: +``` +Step 1: Reserve inventory + └─ Compensation: Release inventory + +Step 2: Charge payment + └─ Compensation: Refund payment + +Step 3: Ship order + └─ Compensation: Cancel shipment + +On failure at step 3: + Execute: Refund payment (step 2 compensation) + Execute: Release inventory (step 1 compensation) +``` + +**Implementation Pattern**: +1. Track compensation actions as you complete each step +2. On failure, execute compensations in reverse order +3. Handle compensation failures gracefully (log, alert, manual intervention) + +## Parallel Execution + +**Purpose**: Run multiple independent operations concurrently. + +**When to Use**: +- Processing multiple items that don't depend on each other +- Calling multiple APIs simultaneously +- Fan-out/fan-in patterns +- Reducing total workflow duration + +**Patterns**: +- `Promise` / `asyncio` - Use traditional concurrency helpers (e.g. wait for all, wait for first, etc) +- Partial failure handling - Continue with successful results + +## Entity Workflow Pattern + +**Purpose**: Model long-lived entities as workflows that handle events. + +**When to Use**: +- Subscription management +- User sessions +- Shopping carts +- Any stateful entity receiving events over time + +**How It Works**: +``` +Entity Workflow (user-123) + │ + ├── Receives signal: AddItem + │ └── Updates state + │ + ├── Receives signal: UpdateQuantity + │ └── Updates state + │ + ├── Receives query: GetCart + │ └── Returns current state + │ + └── continueAsNew when history grows +``` + +## Timer Patterns + +**Purpose**: Durable delays that survive worker restarts. + +**Use Cases**: +- Scheduled reminders +- Timeout handling +- Delayed actions +- Polling with intervals + +**Characteristics**: +- Timers are durable (persisted in history) +- Can be cancelled + +## Polling Patterns + +### Frequent Polling + +**Purpose**: Frequently (once per second of faster) repeatedly check external state until condition met. + +**Implementation**: + +``` +# Inside Activity (polling_activity): +while not condition_met: + result = await call_external_api() + if result.done: + break + activity.heartbeat("Invoking activity") + await sleep(poll_interval) + + +# In workflow code: +workflow.execute_activity( + polling_activity, + PollingActivityInput(...), + start_to_close_timeout=timedelta(seconds=60), + heartbeat_timeout=timedelta(seconds=2), +) +``` + +To ensure that polling_activity is restarted in a timely manner, we make sure that it heartbeats on every iteration. Note that heartbeating only works if we set the heartbeat_timeout to a shorter value than the Activity start_to_close_timeout timeout + +**Advantage:** Because the polling loop is inside the activity, this does not pollute the workflow history. + +### Infrequent Polling + +**Purpose**: Infrequently (once per minute or slower) repeatedly poll an external service. + +**Implementation**: + +Define an Activty which fails (raises an exception) exactly when polling is not completed. + +The polling loop is accomplised via activity retries, by setting the following Retry options: +- backoff_coefficient: to 1 +- initial_interval: to the polling interval (e.g. 60 seconds) + +This will enable the Activity to be retried exactly on the set interval. + +**Advantage:** Individual Activity retries are not recorded in Workflow History, so this approach can poll for a very long time without affecting the history size. + +## Idempotency Patterns + +**Purpose**: Ensure activities can be safely retried and replayed without causing duplicate side effects. + +**Why It Matters**: Temporal may re-execute activities during retries (on failure) or replay (on worker restart). Without idempotency, this can cause duplicate charges, duplicate emails, duplicate database entries, etc. + +### Using Idempotency Keys + +Pass a unique identifier to external services so they can detect and deduplicate repeated requests: + +``` +Activity: charge_payment(order_id, amount) + │ + └── Call payment API with: + amount: $100 + idempotency_key: "order-{order_id}" + │ + └── Payment provider deduplicates based on key + (second call with same key returns original result) +``` + +**Good idempotency key sources**: +- Workflow ID (unique per workflow execution) +- Business identifier (order ID, transaction ID) +- Workflow ID + activity name + attempt number + +### Check-Before-Act Pattern + +Query the external system's state before making changes: + +``` +Activity: send_welcome_email(user_id) + │ + ├── Check: Has welcome email been sent for user_id? + │ │ + │ ├── YES: Return early (already done) + │ │ + │ └── NO: Send email, mark as sent +``` + +### Designing Idempotent Activities + +1. **Use unique identifiers** as idempotency keys with external APIs +2. **Check before acting**: Query current state before making changes +3. **Make operations repeatable**: Ensure calling twice produces the same result +4. **Record outcomes**: Store transaction IDs or results for verification +5. **Leverage external system features**: Many APIs (Stripe, AWS, etc.) have built-in idempotency key support + +### Tracking State in Workflows + +For complex multi-step operations, track completion status in workflow state: + +``` +Workflow State: + payment_completed: false + shipment_created: false + +Run: + if not payment_completed: + charge_payment(...) + payment_completed = true + + if not shipment_created: + create_shipment(...) + shipment_created = true +``` + +This ensures that on replay, already-completed steps are skipped. + +## Large Data Handling + +**Purpose**: Handle data that exceeds Temporal's payload limits without polluting workflow history. + +**Limits** (see `references/core/gotchas.md` for details): +- Max 2MB per individual payload +- Max 4MB per gRPC message +- Max 50MB for workflow history (aim for <10MB) + +**Key Principle**: Large data should never flow through workflow history. Activities read and write large data directly, passing only small references through the workflow. + +**Wrong Approach**: +``` +Workflow + │ + ├── downloadFromStorage(ref) ──▶ returns large data (enters history) + │ + ├── processData(largeData) ────▶ large data as argument (enters history AGAIN) + │ + └── uploadToStorage(result) ───▶ large data as argument (enters history AGAIN) +``` + +This defeats the purpose—large data enters workflow history multiple times. + +**Correct Approach**: +``` +Workflow + │ + └── processLargeData(inputRef) ──▶ returns outputRef (small string) + │ + └── Activity internally: + download(inputRef) → process → upload → return outputRef +``` + +The workflow only handles references (small strings). The activity does all large data operations internally. + +**Implementation Pattern**: +1. Accept a reference (URL, S3 key, database ID) as activity input +2. Download/fetch the large data inside the activity +3. Process the data inside the activity +4. Upload/store the result inside the activity +5. Return only a reference to the result + +**Other Strategies**: +- **Compression**: Use a PayloadCodec to compress data automatically +- **Chunking**: Split large collections across multiple activities, each handling a subset + +## Activity Heartbeating + +**Purpose**: Enable cancellation delivery and progress tracking for long-running activities. + +**Why Heartbeat**: +1. **Support activity cancellation** - Cancellations are delivered to activities via heartbeat. Activities that don't heartbeat won't know they've been cancelled. +2. **Resume progress after failure** - Heartbeat details persist across retries, allowing activities to resume where they left off. +3. **Detect stuck activities** - If an activity stops heartbeating, Temporal can time it out and retry. + +**How Cancellation Works**: +``` +Workflow requests activity cancellation + │ + ▼ +Temporal Service marks activity for cancellation + │ + ▼ +Activity calls heartbeat() + │ + ├── Not cancelled: heartbeat succeeds, continues + │ + └── Cancelled: heartbeat raises exception + Activity can catch this to perform cleanup +``` + +**Key Point**: If an activity never heartbeats, it will run to completion even if cancelled—it has no way to learn about the cancellation. + +## Local Activities + +**Purpose**: Reduce latency for short, lightweight operations by skipping the task queue. ONLY use these when necessary for performance. Do NOT use these by default, as they are not durable and distributed. + +**When to Use**: +- Short operations completing in milliseconds/seconds +- High-frequency calls where task queue overhead is significant +- Low-latency requirements where you can't afford task queue round-trip + +**Characteristics**: +- Executes on the same worker that runs the workflow +- No task queue round-trip (lower latency) +- Still recorded in history +- Should complete quickly (default timeout is short) + +**Trade-offs**: +- Less visibility in Temporal UI (no separate task) +- Must complete on the same worker +- Not suitable for long-running operations + +## Choosing Between Patterns + +| Need | Pattern | +|------|---------| +| Send data, don't need response | Signal | +| Read state, no modification | Query | +| Modify state, need response | Update | +| Break down large workflow | Child Workflow | +| Prevent history growth | Continue-as-New | +| Rollback on failure | Saga | +| Process items concurrently | Parallel Execution | +| Long-lived stateful entity | Entity Workflow | +| Safe retries/replays | Idempotency | +| Low-latency short operations | Local Activities | diff --git a/references/core/troubleshooting.md b/references/core/troubleshooting.md new file mode 100644 index 0000000..e4ef2cb --- /dev/null +++ b/references/core/troubleshooting.md @@ -0,0 +1,323 @@ +# Temporal Troubleshooting Guide + +## Workflow Diagnosis Decision Tree + +``` +Workflow not behaving as expected? +│ +├─▶ What is the workflow status? +│ │ +│ ├─▶ RUNNING (but no progress) +│ │ └─▶ Go to: "Workflow Stuck" section +│ │ +│ ├─▶ FAILED +│ │ └─▶ Go to: "Workflow Failed" section +│ │ +│ ├─▶ TIMED_OUT +│ │ └─▶ Go to: "Timeout Issues" section +│ │ +│ └─▶ COMPLETED (but wrong result) +│ └─▶ Go to: "Wrong Result" section +``` + +## Workflow Stuck (RUNNING but No Progress) + +### Decision Tree + +``` +Workflow stuck in RUNNING? +│ +├─▶ Is a worker running? +│ │ +│ ├─▶ NO: Start a worker +│ │ └─▶ See references/core/dev-management.md +│ │ +│ └─▶ YES: Is it on the correct task queue? +│ │ +│ ├─▶ NO: Start worker with correct task queue +│ │ +│ └─▶ YES: Check for non-determinism +│ │ +│ ├─▶ NondeterminismError in logs? +│ │ └─▶ Go to: "Non-Determinism" section +│ │ +│ ├─▶ Check history for task failures +│ │ └─▶ Run: `temporal workflow show --workflow-id ` +│ │ │ +│ │ ├─▶ WorkflowTaskFailed event? +│ │ │ └─▶ Check error type in event details +│ │ │ └─▶ Go to relevant section in error-reference.md +│ │ │ +│ │ └─▶ ActivityTaskFailed event? +│ │ └─▶ Go to: "Activity Keeps Retrying" section +│ │ +│ └─▶ No errors in logs or history? +│ └─▶ Check if workflow is waiting for signal/timer +``` + +### Common Causes + +1. **No worker running** + - See references/core/dev-management.md + +2. **Worker on wrong task queue** + - Check: Worker logs for task queue name + - Fix: Start worker with matching task queue + +3. **Worker has stale code** + - Check: Worker startup time vs code changes + - Fix: Restart worker with updated code + +4. **Workflow waiting for signal** + - Check: Workflow history for pending signals + - Fix: Send expected signal or check signal sender + +5. **Activity stuck/timing out** + - Check: Activity retry attempts in history + - Fix: Investigate activity failure, increase timeout + +## Non-Determinism Errors + +### Decision Tree + +``` +NondeterminismError? +│ +├─▶ Was code intentionally changed? +│ │ +│ ├─▶ YES: Do you need to support in-flight workflows? +│ │ │ +│ │ ├─▶ YES (production): Use patching API +│ │ │ └─▶ See: references/core/versioning.md +│ │ │ +│ │ └─▶ NO (local dev/testing): Terminate or reset workflow +│ │ └─▶ `temporal workflow terminate --workflow-id ` +│ │ └─▶ Then start fresh with new code +│ │ +│ └─▶ NO: Accidental change +│ │ +│ ├─▶ Can you identify the change? +│ │ │ +│ │ ├─▶ YES: Revert and restart worker. Note, this doesn't always work if workflow has progressed past the change (may induce other code paths), so may need to reset workflow. +│ │ │ +│ │ └─▶ NO: Compare current code to expected history +│ │ └─▶ Check: Activity names, order, parameters +``` + +### Common Causes + +1. **Changed call order** + ``` + # Before # After (BREAKS) + await activity_a await activity_b + await activity_b await activity_a + ``` + +2. **Changed call name** + ``` + # Before # After (BREAKS) + await process_order(...) await handle_order(...) + ``` + +3. **Added/removed call** + - Adding new activity mid-workflow + - Removing activity that was previously called + +4. **Using non-deterministic code** + - `datetime.now()` in workflow (use `workflow.now()`) + - `random.random()` in workflow (use `workflow.random()`) + +### Recovery + +**Accidental Change:** +1. Identify the change +2. Revert code to match history +3. Restart worker +4. Workflow automatically recovers + +**Intentional Change:** +1. Use patching API for gradual migration +2. Or terminate old workflows, start new ones + +## Workflow Failed + +### Decision Tree + +``` +Workflow status = FAILED? +│ +├─▶ Check workflow error message +│ │ +│ ├─▶ Application error (your code) +│ │ └─▶ Fix the bug, start new workflow +│ │ +│ ├─▶ NondeterminismError +│ │ └─▶ Go to: "Non-Determinism" section +│ │ +│ └─▶ Timeout error +│ └─▶ Go to: "Timeout Issues" section +``` + +### Common Causes + +1. **Unhandled exception in workflow** + - Check error message and stack trace + - Fix bug in workflow code + +2. **Activity exhausted retries** + - All retry attempts failed + - Check activity logs for root cause + +3. **Non-retryable error thrown** + - Error marked as non-retryable + - Intentional failure, check business logic + +## Timeout Issues + +### Timeout Types + +| Timeout | Scope | What It Limits | +|---------|-------|----------------| +| `WorkflowExecutionTimeout` | Entire workflow | Total time including retries and continue-as-new | +| `WorkflowRunTimeout` | Single run | Time for one run (before continue-as-new) | +| `ScheduleToCloseTimeout` | Activity | Total time including retries | +| `StartToCloseTimeout` | Activity | Single attempt time | +| `HeartbeatTimeout` | Activity | Time between heartbeats | + +### Diagnosis + +``` +Timeout error? +│ +├─▶ Which timeout? +│ │ +│ ├─▶ Workflow timeout +│ │ └─▶ Increase timeout or optimize workflow. Better yet, consider removing the workflow timeout, as it is generally discourged unless *necessary* for your use case. +│ │ +│ ├─▶ ScheduleToCloseTimeout +│ │ └─▶ Activity taking too long overall (including retries) +│ │ +│ ├─▶ StartToCloseTimeout +│ │ └─▶ Single activity attempt too slow +│ │ +│ └─▶ HeartbeatTimeout +│ └─▶ Activity not heartbeating frequently enough +│ └─▶ Add heartbeat() calls in long activities +``` + +### Fixes + +1. **Increase timeout** if operation legitimately takes longer +2. **Add heartbeats** to long-running activities +3. **Optimize activity** to complete faster +4. **Break into smaller activities** for better granularity + +## Activity Keeps Retrying + +### Decision Tree + +``` +Activity retrying repeatedly? +│ +├─▶ Check activity error +│ │ +│ ├─▶ Transient error (network, timeout) +│ │ └─▶ Expected behavior, will eventually succeed +│ │ +│ ├─▶ Permanent error (bug, invalid input) +│ │ └─▶ Fix the bug or mark as non-retryable +│ │ +│ └─▶ Resource exhausted +│ └─▶ Add backoff, check rate limits +``` + +### Common Causes + +1. **Bug in activity code** + - Fix the bug + - Consider marking certain errors as non-retryable + +2. **External service down** + - Retries are working as intended + - Monitor service recovery + +3. **Invalid input** + - Validate inputs before activity + - Return non-retryable error for bad input + +## Wrong Result (Completed but Incorrect) + +### Diagnosis + +1. **Check workflow history** for unexpected activity results +2. **Verify activity implementations** produce correct output +3. **Check for race conditions** in parallel execution +4. **Verify signal handling** if signals are involved + +### Common Causes + +1. **Activity bug** - Wrong logic in activity +2. **Stale data** - Activity using outdated information +3. **Signal ordering** - Signals processed in unexpected order +4. **Parallel execution** - Race condition in concurrent operations + +## Worker Issues + +### Worker Not Starting + +``` +Worker won't start? +│ +├─▶ Connection error +│ └─▶ Check Temporal server is running +│ └─▶ `temporal server start-dev` (start in background, see references/core/dev-management.md) +│ +├─▶ Registration error +│ └─▶ Check workflow/activity definitions are valid +│ +└─▶ Other errors (imports, etc.) + └─▶ Debug those errors as usual. +``` + +### Worker Crashing + +1. **Out of memory** - Reduce concurrent tasks, check for leaks +2. **Unhandled exception** - Add error handling +3. **Dependency issue** - Check package versions + +## Useful Commands + +```bash +# Check Temporal server +temporal server start-dev + +# List workflows +temporal workflow list + +# Describe specific workflow +temporal workflow describe --workflow-id + +# Show workflow history +temporal workflow show --workflow-id + +# Terminate stuck workflow +temporal workflow terminate --workflow-id + +# Reset workflow to specific point +temporal workflow reset --workflow-id --event-id +``` + +## Quick Reference: Status → Action + +| Status | First Check | Common Fix | +|--------|-------------|------------| +| RUNNING (stuck) | Worker running? | Start/restart worker | +| FAILED | Error message | Fix bug, handle error | +| TIMED_OUT | Which timeout? | Increase timeout or optimize | +| TERMINATED | Who terminated? | Check audit log | +| CANCELED | Cancellation source | Expected or investigate | + +## See Also + +- [Common Gotchas](gotchas.md) - Anti-patterns that cause these issues +- [Error Reference](error-reference.md) - Quick error type lookup diff --git a/references/core/versioning.md b/references/core/versioning.md new file mode 100644 index 0000000..226bb83 --- /dev/null +++ b/references/core/versioning.md @@ -0,0 +1,174 @@ +# Workflow Versioning Concepts + +This document provides core conceptual explanations of workflow versioning in Temporal. For language-specific implementation details see `references/{your_language}/versioning.md`, for the language you are working in. + +## Overview + +Workflow versioning allows safe deployment of code changes without breaking running workflows. Three approaches available: + +1. **Patching API** - Code-level version branching +2. **Workflow Type Versioning** - New workflow types for incompatible changes +3. **Worker Versioning** - Deployment-level control with Build IDs + +## Why Versioning is Needed + +When workers restart after deployment, they resume open workflows through history replay. If updated code produces different Commands than the original code, it causes non-determinism errors. + +``` +Original Code (recorded in history): + await activity_a() + await activity_b() + +Updated Code (during replay): + await activity_a() + await activity_c() ← Different! NondeterminismError +``` + +## Approach 1: Patching API + +### Concept + +The patching API lets you branch code based on whether a workflow was started before or after a code change. + +``` +if patched("my-change"): + // New code path (for new and replaying new workflows) +else: + // Old code path (for replaying old workflows) +``` + +### Three-Phase Lifecycle + +**Phase 1: Patch In** +- Add both old and new code paths +- New workflows take new path, old workflows take old path + +**Phase 2: Deprecate** +- After all old workflows complete, remove old code +- Keep deprecation marker for history compatibility + +**Phase 3: Remove** +- After all deprecated workflows complete +- Remove patch entirely, only new code remains + +### When to Use + +- Adding, removing, or reordering activities/child workflows +- Changing which activity/child workflow is called +- Any change that alters the Command sequence + +### When NOT to Use + +- Changing activity implementations (activities aren't replayed) +- Changing arguments passed to activities or child workflows +- Changing retry policies +- Changing timer durations +- Adding new signal/query/update handlers (additive changes are safe) +- Bug fixes that don't change Command sequence + +Unnecessary patching adds complexity and can make workflow code unmanageable. + +## Approach 2: Workflow Type Versioning + +### Concept + +Create a new workflow type (e.g., `OrderWorkflowV2`) instead of patching. + +``` +// Old: OrderWorkflow +// New: OrderWorkflowV2 (completely new implementation) +``` + +### When to Use + +- Major incompatible changes +- Complete rewrites +- When patching would be too complex +- When you want clean separation + +### Process + +1. Create new workflow type with new name +2. Register both with worker +3. Start new workflows with new type +4. Wait for old workflows to complete +5. Remove old workflow type + +## Approach 3: Worker Versioning + +### Concept + +Manage versions at deployment level using Build IDs. Multiple worker versions can run simultaneously. + +``` +Worker v1.0 (Build ID: abc123) + └── Handles workflows started on this version + +Worker v2.0 (Build ID: def456) + └── Handles new workflows + └── Can also handle upgraded old workflows +``` + +### Key Concepts + +**Worker Deployment**: Logical service grouping (e.g., "order-service") + +**Build ID**: Specific code version (e.g., git commit hash) + +**Versioning Behaviors**: +- `PINNED` - Workflows stay on original worker version +- `AUTO_UPGRADE` - Workflows can move to newer versions + +### When to Use PINNED + +- Short-running workflows (minutes to hours) +- Consistency is critical +- Want simplest development experience +- Building new applications + +### When to Use AUTO_UPGRADE + +- Long-running workflows (weeks or months) +- Workflows need bug fixes during execution +- Still requires patching for version transitions + +## Choosing an Approach + +| Scenario | Recommended Approach | +|----------|---------------------| +| Small change, few running workflows | Patching API | +| Major rewrite | Workflow Type Versioning | +| Many short workflows, frequent deploys | Worker Versioning (PINNED) | +| Long-running workflows needing updates | Worker Versioning (AUTO_UPGRADE) + Patching | +| Quick fix, can wait for completion | Wait for workflows to complete | + +## Best Practices + +1. **Check for open executions** before removing old code +2. **Use descriptive patch IDs** (e.g., "add-fraud-check" not "patch-1") +3. **Deploy incrementally**: patch → deprecate → remove +4. **Test replay compatibility** before deploying changes +5. **Monitor old workflow counts** during migration + +## Finding Workflows by Version + +```bash +# Find workflows with specific patch +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND TemporalChangeVersion = "add-fraud-check"' + +# Find pre-patch workflows +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND TemporalChangeVersion IS NULL' + +# Find workflows on specific worker version +temporal workflow list --query \ + 'TemporalWorkerDeploymentVersion = "my-service:v1.0.0"' +``` + +## Common Mistakes + +1. **Removing old code too early** - Breaks replaying workflows +2. **Not testing with replay** - Catches issues before production +3. **Patching non-Command changes** - Unnecessary complexity +4. **Forgetting to deprecate** - Accumulates dead code diff --git a/references/python/advanced-features.md b/references/python/advanced-features.md new file mode 100644 index 0000000..e0d3297 --- /dev/null +++ b/references/python/advanced-features.md @@ -0,0 +1,166 @@ +# Python SDK Advanced Features + +## Schedules + +Create recurring workflow executions. + +```python +from temporalio.client import ( + Schedule, + ScheduleActionStartWorkflow, + ScheduleSpec, + ScheduleIntervalSpec, +) + +# Create a schedule +schedule_id = "daily-report" +await client.create_schedule( + schedule_id, + Schedule( + action=ScheduleActionStartWorkflow( + DailyReportWorkflow.run, + id="daily-report", + task_queue="reports", + ), + spec=ScheduleSpec( + intervals=[ScheduleIntervalSpec(every=timedelta(days=1))], + ), + ), +) + +# Manage schedules +schedule = client.get_schedule_handle(schedule_id) +await schedule.pause("Maintenance window") +await schedule.unpause() +await schedule.trigger() # Run immediately +await schedule.delete() +``` + +## Async Activity Completion + +For activities that complete asynchronously (e.g., human tasks, external callbacks). +If you configure a heartbeat_timeout on this activity, the external completer is responsible for sending heartbeats via the async handle. +If you do NOT set a heartbeat_timeout, no heartbeats are required. + +**Note:** If the external system that completes the asynchronous action can reliably be trusted to do the task and Signal back with the result, and it doesn't need to Heartbeat or receive Cancellation, then consider using **signals** instead. + +```python +from temporalio import activity +from temporalio.client import Client + +@activity.defn +async def request_approval(request_id: str) -> None: + # Get task token for async completion + task_token = activity.info().task_token + + # Store task token for later completion (e.g., in database) + await store_task_token(request_id, task_token) + + # Mark this activity as waiting for external completion + activity.raise_complete_async() + +# Later, complete the activity from another process +async def complete_approval(request_id: str, approved: bool): + client = await Client.connect("localhost:7233", namespace="default") + task_token = await get_task_token(request_id) + + handle = client.get_async_activity_handle(task_token=task_token) + + # Optional: if a heartbeat_timeout was set, you can periodically: + # await handle.heartbeat(progress_details) + + if approved: + await handle.complete("approved") + else: + # You can also fail or report cancellation via the handle + await handle.fail(ApplicationError("Rejected")) +``` + +## Sandbox Customization + +The Python SDK runs workflows in a sandbox to help you ensure determinism. You can customize sandbox restrictions when needed. See `references/python/determinism-protection.md` + +## Gevent Compatibility Warning + +**The Python SDK is NOT compatible with gevent.** Gevent's monkey patching modifies Python's asyncio event loop in ways that break the SDK's deterministic execution model. + +If your application uses gevent: +- You cannot run Temporal workers in the same process +- Consider running workers in a separate process without gevent +- Use a message queue or HTTP API to communicate between gevent and Temporal processes + +## Worker Tuning + +Configure worker performance settings. + +```python +from concurrent.futures import ThreadPoolExecutor + +worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + activities=[my_activity], + # Workflow task concurrency + max_concurrent_workflow_tasks=100, + # Activity task concurrency + max_concurrent_activities=100, + # Executor for sync activities + activity_executor=ThreadPoolExecutor(max_workers=50), + # Graceful shutdown timeout + graceful_shutdown_timeout=timedelta(seconds=30), +) +``` + +## Workflow Init Decorator + +Use `@workflow.init` to run initialization code when a workflow is first created. + +**Purpose:** Execute some setup code before signal/update happens or run is invoked. + +```python +@workflow.defn +class MyWorkflow: + @workflow.init + def __init__(self, initial_value: str) -> None: + # This runs only on first execution, not replay + self._value = initial_value + self._items: list[str] = [] + + @workflow.run + async def run(self) -> str: + # self._value and self._items are already initialized + return self._value +``` + +## Workflow Failure Exception Types + +Control which exceptions cause workflow task failures vs workflow failures. + +- Special case: if you include temporalio.workflow.NondeterminismError (or a superclass), non-determinism errors will fail the workflow instead of leaving it in a retrying state +- **Tip for testing:** Set to `[Exception]` in tests so any unhandled exception fails the workflow immediately rather than retrying the workflow task forever. This surfaces bugs faster. + +### Per-Workflow Configuration + +```python +@workflow.defn( + # These exception types will fail the workflow execution (not just the task) + failure_exception_types=[ValueError, CustomBusinessError] +) +class MyWorkflow: + @workflow.run + async def run(self) -> str: + raise ValueError("This fails the workflow, not just the task") +``` + +### Worker-Level Configuration + +```python +worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + workflow_failure_exception_types=[ValueError, CustomBusinessError], +) +``` + diff --git a/references/python/ai-patterns.md b/references/python/ai-patterns.md new file mode 100644 index 0000000..a07e30a --- /dev/null +++ b/references/python/ai-patterns.md @@ -0,0 +1,334 @@ +# Python AI/LLM Integration Patterns + +## Overview + +This document provides Python-specific implementation details for integrating LLMs with Temporal. For conceptual patterns, see `references/core/ai-integration.md`. + +## Pydantic Data Converter Setup + +**Required** for handling complex types like OpenAI response objects: + +```python +from temporalio.client import Client +from temporalio.contrib.pydantic import pydantic_data_converter + +client = await Client.connect( + "localhost:7233", + namespace="default", + data_converter=pydantic_data_converter, +) +``` + +## OpenAI Client Configuration + +**Critical**: Disable client retries, let Temporal handle them: + +```python +from openai import AsyncOpenAI + +openai_client = AsyncOpenAI( + api_key=os.getenv("OPENAI_API_KEY"), + max_retries=0, # CRITICAL: Disable client retries + timeout=30.0, +) +``` + +## LiteLLM Configuration + +For multi-model support: + +```python +import litellm + +litellm.num_retries = 0 # Disable LiteLLM retries +``` + +## Generic LLM Activity + +Flexible, reusable activity for LLM calls: + +```python +import openai +from temporalio import activity +from temporalio.exceptions import ApplicationError +from pydantic import BaseModel +from typing import Optional, Any + +class LLMRequest(BaseModel): + model: str + system_prompt: str + user_input: str + tools: Optional[list] = None + response_format: Optional[type] = None + temperature: float = 0.7 + +class LLMResponse(BaseModel): + content: str + tool_calls: Optional[list] = None + usage: dict + +@activity.defn +async def call_llm(request: LLMRequest) -> LLMResponse: + """Generic LLM activity supporting multiple use cases.""" + try: + # As an example, calling OpenAI. This could be any chat API you wish though... + response = await openai_client.chat.completions.create( + model=request.model, + messages=[ + {"role": "system", "content": request.system_prompt}, + {"role": "user", "content": request.user_input}, + ], + tools=request.tools, + temperature=request.temperature, + ) + return LLMResponse( + content=response.choices[0].message.content or "", + tool_calls=response.choices[0].message.tool_calls, + usage=response.usage.model_dump(), + ) + + # Some example error cases to handle. These are not necessarily exhaustive, and depend on the API you are actually calling! + except openai.AuthenticationError as e: + # Invalid API key - permanent failure, don't retry + raise ApplicationError( + f"Invalid API key: {e}", + type="AuthenticationError", + non_retryable=True, + ) + + except openai.RateLimitError as e: + # Rate limited - transient, let Temporal retry with backoff + raise ApplicationError( + f"Rate limited: {e}", + type="RateLimitError", + next_retry_delay=... # parse this from headers + ) + + except openai.APIStatusError as e: + if e.status_code >= 500: + # Server error - transient, retry + raise ApplicationError( + f"OpenAI server error ({e.status_code}): {e}", + type="ServerError", + ) + else: + # Other client errors (400, etc.) - likely permanent + raise ApplicationError( + f"OpenAI client error ({e.status_code}): {e}", + type="ClientError", + non_retryable=True, + ) + + except openai.APIConnectionError as e: + # Network error - transient, retry + raise ApplicationError( + f"Connection error: {e}", + type="ConnectionError", + ) +``` + +## Activity Retry Policy + +Configure retries at the workflow level: + +```python +from datetime import timedelta +from temporalio import workflow +from temporalio.common import RetryPolicy + +with workflow.unsafe.imports_passed_through(): + from activities.llm import call_llm, LLMRequest + +@workflow.defn +class LLMWorkflow: + @workflow.run + async def run(self, prompt: str) -> str: + # Note that because call_llm classfies different types of exceptions as retryable / non-retryable, + # we automatically get correct retry behavior just by calling it. + response = await workflow.execute_activity( + call_llm, + LLMRequest( + model="gpt-4", + system_prompt="You are a helpful assistant.", + user_input=prompt, + ), + start_to_close_timeout=timedelta(seconds=30), + ) + return response.content +``` + +## Tool-Calling Agent Workflow + +```python +from temporalio import workflow +from datetime import timedelta +from pydantic import BaseModel + +with workflow.unsafe.imports_passed_through(): + from activities.llm import call_llm, LLMRequest, LLMResponse + from activities.tools import execute_tool + from models.tools import ToolDefinition + +class AgentWorkflowInput(BaseModel): + user_request: str + tools: list[ToolDefinition] + +@workflow.defn +class AgentWorkflow: + @workflow.run + async def run(self, input: AgentWorkflowInput) -> str: + messages = [] + current_input = input.user_request + + while True: + # Phase 1: Get LLM response with tools + response = await workflow.execute_activity( + call_llm, + LLMRequest( + model="gpt-4", + system_prompt="You are a helpful agent with tools.", + user_input=current_input, + tools=[t.to_openai_format() for t in input.tools], + ), + start_to_close_timeout=timedelta(seconds=30), + ) + + # Check if LLM wants to use a tool + if not response.tool_calls: + return response.content + + # Phase 2: Execute tools + for tool_call in response.tool_calls: + tool_result = await workflow.execute_activity( + execute_tool, + tool_call, + start_to_close_timeout=timedelta(seconds=60), + ) + messages.append({ + "role": "tool", + "tool_call_id": tool_call.id, + "content": tool_result, + }) + + # Phase 3: Continue conversation with tool results + current_input = f"Tool results: {messages}" +``` + +## Structured Outputs + +Using Pydantic for validated responses: + +```python +from pydantic import BaseModel +from temporalio import activity + +class AnalysisResult(BaseModel): + sentiment: str + confidence: float + key_topics: list[str] + summary: str + +@activity.defn +async def analyze_text(text: str) -> AnalysisResult: + response = await openai_client.beta.chat.completions.parse( + model="gpt-4o", + messages=[ + {"role": "system", "content": "Analyze the following text."}, + {"role": "user", "content": text}, + ], + response_format=AnalysisResult, + ) + return response.choices[0].message.parsed +``` + +## Multi-Agent Pipeline (Deep Research) + +```python +from temporalio import workflow +from datetime import timedelta +import asyncio + +with workflow.unsafe.imports_passed_through(): + from activities.research import ( + generate_subtopics, + generate_search_queries, + search_web, + synthesize_report, + ) + +@workflow.defn +class DeepResearchWorkflow: + @workflow.run + async def run(self, topic: str) -> str: + # Phase 1: Planning + subtopics = await workflow.execute_activity( + generate_subtopics, + topic, + start_to_close_timeout=timedelta(seconds=60), + ) + + # Phase 2: Query Generation + queries = await workflow.execute_activity( + generate_search_queries, + subtopics, + start_to_close_timeout=timedelta(seconds=60), + ) + + # Phase 3: Parallel Web Search (resilient to partial failures) + search_tasks = [ + workflow.execute_activity( + search_web, + query, + start_to_close_timeout=timedelta(seconds=300), + schedule_to_close_timeout=timedelta(seconds=900), # We set a schedule to close timeout, so that if one search task repeatadly fails, then it won't hang up all the rest, in the below gather step. + ) + for query in queries + ] + + # Continue with partial results on failure + results = await asyncio.gather(*search_tasks, return_exceptions=True) + successful_results = [r for r in results if not isinstance(r, Exception)] + + # Phase 4: Synthesis + report = await workflow.execute_activity( + synthesize_report, + {"topic": topic, "research": successful_results}, + start_to_close_timeout=timedelta(seconds=300), + ) + + return report +``` + +## OpenAI Agents SDK Integration + +If using the OpenAI Agent SDK to create an agent, use Temporal's OpenAI contrib module to create a Temporal-aware durable agent: + +```python +from temporalio import workflow +from temporalio.contrib.openai import create_workflow_agent +from agents import Agent, Runner + +@workflow.defn +class DurableAgentWorkflow: + @workflow.run + async def run(self, task: str) -> str: + # Create a Temporal-aware agent + agent = create_workflow_agent( + model="gpt-4", + tools=[search_tool, calculator_tool], + ) + # Run it. Under the hood, the automatically dispatches to activities for LLM calls, etc. + result = await agent.run(task) + return result.output +``` + +## Best Practices + +1. **Always use Pydantic data converter** for complex types +2. **Disable retries in LLM clients** (max_retries=0) +3. **Set appropriate timeouts** per operation type +4. **Use structured outputs** for type safety +5. **Handle partial failures** in parallel operations +6. **Mock activities in tests** for fast, deterministic testing +7. **Log token usage** for cost tracking +8. **Version prompts** in code for reproducibility diff --git a/references/python/data-handling.md b/references/python/data-handling.md new file mode 100644 index 0000000..662101e --- /dev/null +++ b/references/python/data-handling.md @@ -0,0 +1,230 @@ +# Python SDK Data Handling + +## Overview + +The Python SDK uses data converters to serialize/deserialize workflow inputs, outputs, and activity parameters. + +## Default Data Converter + +The default converter handles: +- `None` +- `bytes` (as binary) +- Protobuf messages +- JSON-serializable types (dict, list, str, int, float, bool) + +## Pydantic Integration + +Use Pydantic models for validated, typed data. + +In your workflow definition, just use input and result types that subclass `pydantic.BaseModel`: + +```python +from pydantic import BaseModel + +class OrderInput(BaseModel): + order_id: str + items: list[str] + total: float + customer_email: str + +class OrderResult(BaseModel): + order_id: str + status: str + tracking_number: str | None = None + +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, input: OrderInput) -> OrderResult: + # Pydantic validation happens automatically + return OrderResult( + order_id=input.order_id, + status="completed", + tracking_number="TRK123", + ) +``` + +And when you configure the client, pass the `pydantic_data_converter`: + +```python +from temporalio.contrib.pydantic import pydantic_data_converter +# Configure client with Pydantic support +client = await Client.connect( + "localhost:7233", + namespace="default", + data_converter=pydantic_data_converter, +) +``` + +## Custom Data Conversion + +Usually the easiest way to do this is via implementing an EncodingPayloadConverter and CompositePayloadConverter. See: +- https://raw.githubusercontent.com/temporalio/samples-python/refs/heads/main/custom_converter/shared.py +- https://raw.githubusercontent.com/temporalio/samples-python/refs/heads/main/custom_converter/starter.py + +for an extended example. + +## Payload Encryption + +Encrypt sensitive workflow data. + +```python +from temporalio.converter import PayloadCodec +from temporalio.api.common.v1 import Payload +from cryptography.fernet import Fernet +from typing import Sequence + +class EncryptionCodec(PayloadCodec): + def __init__(self, key: bytes): + self._fernet = Fernet(key) + + async def encode(self, payloads: Sequence[Payload]) -> list[Payload]: + return [ + Payload( + metadata={"encoding": b"binary/encrypted"}, + # Since encryption uses C extensions that give up the GIL, we can avoid blocking the async event loop here. + data=await asyncio.to_thread(self._fernet.encrypt, p.SerializeToString()), + ) + for p in payloads + ] + + async def decode(self, payloads: Sequence[Payload]) -> list[Payload]: + result = [] + for p in payloads: + if p.metadata.get("encoding") == b"binary/encrypted": + decrypted = await asyncio.to_thread(self._fernet.decrypt, p.data) + decoded = Payload() + decoded.ParseFromString(decrypted) + result.append(decoded) + else: + result.append(p) + return result + +# Apply encryption codec +client = await Client.connect( + "localhost:7233", + namespace="default", + data_converter=DataConverter( + payload_codec=EncryptionCodec(encryption_key), + ), +) +``` + +## Search Attributes + +Custom searchable fields for workflow visibility. These can be created at workflow start: + +```python +from temporalio.common import ( + SearchAttributeKey, + SearchAttributePair, + TypedSearchAttributes, +) +from datetime import datetime +from datetime import timezone + +ORDER_ID = SearchAttributeKey.for_keyword("OrderId") +ORDER_STATUS = SearchAttributeKey.for_keyword("OrderStatus") +ORDER_TOTAL = SearchAttributeKey.for_float("OrderTotal") +CREATED_AT = SearchAttributeKey.for_datetime("CreatedAt") + +# At workflow start +handle = await client.start_workflow( + OrderWorkflow.run, + order, + id=f"order-{order.id}", + task_queue="orders", + search_attributes=TypedSearchAttributes([ + SearchAttributePair(ORDER_ID, order.id), + SearchAttributePair(ORDER_STATUS, "pending"), + SearchAttributePair(ORDER_TOTAL, order.total), + SearchAttributePair(CREATED_AT, datetime.now(timezone.utc)), + ]), +) +``` + +Or upserted during workflow execution: + +```python +from temporalio import workflow +from temporalio.common import SearchAttributeKey, SearchAttributePair, TypedSearchAttributes + +ORDER_STATUS = SearchAttributeKey.for_keyword("OrderStatus") + +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + # ... process order ... + + # Update search attribute + workflow.upsert_search_attributes(TypedSearchAttributes([ + SearchAttributePair(ORDER_STATUS, "completed"), + ])) + return "done" +``` + +### Querying Workflows by Search Attributes + +```python +# List workflows using search attributes +async for workflow in client.list_workflows( + 'OrderStatus = "processing" OR OrderStatus = "pending"' +): + print(f"Workflow {workflow.id} is still processing") +``` + +## Workflow Memo + +Store arbitrary metadata with workflows (not searchable). + +```python +# Set memo at workflow start +await client.execute_workflow( + OrderWorkflow.run, + order, + id=f"order-{order.id}", + task_queue="orders", + memo={ + "customer_name": order.customer_name, + "notes": "Priority customer", + }, +) +``` + +```python +# Read memo from workflow +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + notes: str = workflow.memo_value("notes", type_hint=str) + ... +``` + +## Deterministic APIs for Values + +Use these APIs within workflows for deterministic random values and UUIDs: + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + # Deterministic UUID (same on replay) + unique_id = workflow.uuid4() + + # Deterministic random (same on replay) + rng = workflow.random() + value = rng.randint(1, 100) + + return str(unique_id) +``` + +## Best Practices + +1. Use Pydantic for input/output validation +2. Keep payloads small—see `references/core/gotchas.md` for limits +3. Encrypt sensitive data with PayloadCodec +4. Use dataclasses for simple data structures +5. Use `workflow.uuid4()` and `workflow.random()` for deterministic values diff --git a/references/python/determinism-protection.md b/references/python/determinism-protection.md new file mode 100644 index 0000000..1376ced --- /dev/null +++ b/references/python/determinism-protection.md @@ -0,0 +1,233 @@ +# Python Workflow Sandbox + +## Overview + +The Python SDK runs workflows in a sandbox that provides automatic protection against non-deterministic operations. This is unique to the Python SDK. + +## How the Sandbox Works + +The sandbox: +- Isolates global state via `exec` compilation +- Restricts non-deterministic library calls via proxy objects +- Passes through standard library with restrictions +- Reloads workflow files on each execution + +## Forbidden Operations + +These operations will fail in the sandbox: + +- **Direct I/O**: Network calls, file reads/writes +- **Threading**: `threading` module operations +- **Subprocess**: `subprocess` calls +- **Global state**: Modifying mutable global variables +- **Blocking sleep**: `time.sleep()` (use `workflow.sleep(timedelta(...))`) + +## Pass-Through Pattern + +Third-party libraries that aren't sandbox-aware need explicit pass-through: + +```python +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + import pydantic + from my_module import my_dataclass +``` + +**When to use pass-through:** +- Data classes and models (Pydantic, dataclasses) +- Serialization libraries +- Type definitions +- Any library that doesn't do I/O or non-deterministic operations +- Performance, as many non-passthrough imports can be slower + +**Note:** The imports, even when using `imports_passed_through`, should all be at the top of the file. Runtime imports are an anti-pattern. + +## Importing Activities + +Activities should be imported through pass-through since they're defined outside the sandbox: + +```python +# workflows/order.py +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities.payment import process_payment + from activities.shipping import ship_order + +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order_id: str) -> str: + await workflow.execute_activity( + process_payment, + order_id, + start_to_close_timeout=timedelta(minutes=5), + ) + return await workflow.execute_activity( + ship_order, + order_id, + start_to_close_timeout=timedelta(minutes=10), + ) +``` + +## Disabling the Sandbox + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + with workflow.unsafe.sandbox_unrestricted(): + # Unrestricted code block + pass + return "result" +``` + +- Per‑block escape hatch from runtime restrictions; imports unchanged. +- Use when: You need to call something the sandbox would normally block (e.g., a restricted stdlib call) in a very small, controlled section. +- **IMPORTANT:** Use it sparingly; you lose determinism checks inside the block +- Genuinely non-deterministic code still *MUST* go into activities. + +## Customizing Invalid Module Members + +`invalid_module_members` includes modules that cannot be accessed. + +Checks are compared against the fully qualified path to the item. + +```python +import dataclasses +from temporalio.worker import Worker +from temporalio.worker.workflow_sandbox import ( + SandboxedWorkflowRunner, + SandboxMatcher, + SandboxRestrictions, +) + +# Example 1: Remove a restriction on datetime.date.today(): +restrictions = dataclasses.replace( + SandboxRestrictions.default, + invalid_module_members=SandboxRestrictions.invalid_module_members_default.with_child_unrestricted( + "datetime", "date", "today", + ), +) + +# Example 2: Restrict the datetime.date class from being used +restrictions = dataclasses.replace( + SandboxRestrictions.default, + invalid_module_members=SandboxRestrictions.invalid_module_members_default | SandboxMatcher( + children={"datetime": SandboxMatcher(use={"date"})}, + ), +) + +worker = Worker( + ..., + workflow_runner=SandboxedWorkflowRunner(restrictions=restrictions), +) +``` + +## Import Notification Policy + +Control warnings/errors for sandbox import issues. Recommended for catching potential problems: + +```python +from temporalio import workflow +from temporalio.worker.workflow_sandbox import SandboxedWorkflowRunner, SandboxRestrictions + +restrictions = SandboxRestrictions.default.with_import_notification_policy( + workflow.SandboxImportNotificationPolicy.WARN_ON_DYNAMIC_IMPORT + | workflow.SandboxImportNotificationPolicy.WARN_ON_UNINTENTIONAL_PASSTHROUGH +) + +worker = Worker( + ..., + workflow_runner=SandboxedWorkflowRunner(restrictions=restrictions), +) +``` + +- `WARN_ON_DYNAMIC_IMPORT` (default) - warns on imports after initial workflow load +- `WARN_ON_UNINTENTIONAL_PASSTHROUGH` - warns when modules are imported into sandbox without explicit passthrough (not default, but highly recommended for catching missing passthroughs) +- `RAISE_ON_UNINTENTIONAL_PASSTHROUGH` - raise instead of warn + +Override per-import with the context manager: + +```python +with workflow.unsafe.sandbox_import_notification_policy( + workflow.SandboxImportNotificationPolicy.SILENT +): + import pydantic # No warning for this import +``` + +## Disable Lazy sys.modules Passthrough + +By default, passthrough modules are lazily added to the sandbox's `sys.modules` when accessed. To require explicit imports: + +```python +import dataclasses +from temporalio.worker.workflow_sandbox import SandboxedWorkflowRunner, SandboxRestrictions + +restrictions = dataclasses.replace( + SandboxRestrictions.default, + disable_lazy_sys_module_passthrough=True, +) + +worker = Worker( + ..., + workflow_runner=SandboxedWorkflowRunner(restrictions=restrictions), +) +``` + +When `True`, passthrough modules must be explicitly imported to appear in the sandbox's `sys.modules`. + +## File Organization + +**Critical**: Keep workflow definitions in separate files from activity definitions. + +The sandbox reloads workflow definition files on every execution. Minimizing file contents improves Worker performance. + +``` +my_temporal_app/ +├── workflows/ +│ └── order.py # Only workflow classes +├── activities/ +│ └── payment.py # Only activity functions +├── models/ +│ └── order.py # Shared data models +├── worker.py # Worker setup, imports both +└── starter.py # Client code +``` + +## Common Issues + +### Import Errors + +``` +Error: Cannot import 'pydantic' in sandbox +``` + +**Fix**: Use pass-through: + +```python +with workflow.unsafe.imports_passed_through(): + import pydantic +``` + +### Non-Determinism from Libraries + +Some libraries do internal caching or use current time: + +```python +# May cause non-determinism +import some_library +result = some_library.cached_operation() # Cache changes between replays +``` + +**Fix**: Move to activity or use pass-through with caution. + +## Best Practices + +1. **Separate workflow and activity files** for performance +2. **Use pass-through explicitly** for third-party libraries +3. **Keep workflow files small** to minimize reload time +4. **Move I/O to activities** always +5. **Test with replay** to catch sandbox issues early diff --git a/references/python/determinism.md b/references/python/determinism.md new file mode 100644 index 0000000..7276360 --- /dev/null +++ b/references/python/determinism.md @@ -0,0 +1,51 @@ +# Python SDK Determinism + +## Overview + +The Python SDK runs workflows in a sandbox that provides automatic protection against many non-deterministic operations. + +## Why Determinism Matters: History Replay + +Temporal provides durable execution through **History Replay**. When a Worker needs to restore workflow state (after a crash, cache eviction, or to continue after a long timer), it re-executes the workflow code from the beginning, which requires the workflow code to be **deterministic**. + +## Forbidden Operations + +- Direct I/O (network, filesystem) +- Threading operations +- `subprocess` calls +- Global mutable state modification +- `time.sleep()` (use `workflow.sleep(timedelta(...))`) +- and so on + +## Safe Builtin Alternatives to Common Non Deterministic Things + +| Forbidden | Safe Alternative | +|-----------|------------------| +| `datetime.now()` | `workflow.now()` | +| `datetime.utcnow()` | `workflow.now()` | +| `random.random()` | `rng = workflow.new_random() ; rng.randint(1, 100)` | +| `uuid.uuid4()` | `workflow.uuid4()` | +| `time.time()` | `workflow.now().timestamp()` | + +## Testing Replay Compatibility + +Use the `Replayer` class to verify your code changes are compatible with existing histories. See the Workflow Replay Testing section of `references/python/testing.md`. + +## Sandbox Behavior + +The sandbox: +- Isolates global state via `exec` compilation +- Restricts non-deterministic library calls via proxy objects +- Passes through standard library with restrictions + +See more info at `references/python/determinism-protection.md` + +## Best Practices + +1. Use `workflow.now()` for all time operations +2. Use `workflow.random()` for random values +3. Use `workflow.uuid4()` for unique identifiers +4. Pass through third-party libraries explicitly +5. Test with replay to catch non-determinism +6. Keep workflows focused on orchestration, delegate I/O to activities +7. Use `workflow.logger` instead of print() for replay-safe logging diff --git a/references/python/error-handling.md b/references/python/error-handling.md new file mode 100644 index 0000000..19460cb --- /dev/null +++ b/references/python/error-handling.md @@ -0,0 +1,138 @@ +# Python SDK Error Handling + +## Overview + +The Python SDK uses `ApplicationError` for application-specific errors and provides comprehensive retry policy configuration. Generally, the following information about errors and retryability applies across activities, child workflows and Nexus operations. + +## Application Errors + +```python +from temporalio import activity +from temporalio.exceptions import ApplicationError + +@activity.defn +async def validate_order(order: Order) -> None: + if not order.is_valid(): + raise ApplicationError( + "Invalid order", + type="ValidationError", + ) +``` + +## Non-Retryable Errors + +```python +from dataclasses import dataclass +from temporalio import activity +from temporalio.exceptions import ApplicationError + +@dataclass +class ChargeCardInput: + card_number: str + amount: float + +@activity.defn +async def charge_card(input: ChargeCardInput) -> str: + if not is_valid_card(input.card_number): + raise ApplicationError( + "Permanent failure - invalid credit card", + type="PaymentError", + non_retryable=True, # Will not retry activity + ) + return await process_payment(input.card_number, input.amount) +``` + +## Handling Activity Errors + +```python +from datetime import timedelta +from temporalio import workflow +from temporalio.exceptions import ActivityError, ApplicationError + +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + try: + return await workflow.execute_activity( + risky_activity, + start_to_close_timeout=timedelta(minutes=5), + ) + except ActivityError as e: + workflow.logger.error(f"Activity failed: {e}") + # Handle or re-raise + raise ApplicationError("Workflow failed due to activity error") +``` + +## Retry Policy Configuration + +```python +from datetime import timedelta +from temporalio import workflow +from temporalio.common import RetryPolicy + +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + result = await workflow.execute_activity( + my_activity, + start_to_close_timeout=timedelta(minutes=10), + retry_policy=RetryPolicy( + maximum_interval=timedelta(minutes=1), + maximum_attempts=5, + non_retryable_error_types=["ValidationError", "PaymentError"], + ), + ) + return result +``` + +Only set options such as maximum_interval, maximum_attempts etc. if you have a domain-specific reason to. +If not, prefer to leave them at their defaults. + +## Timeout Configuration + +```python +from datetime import timedelta +from temporalio import workflow + +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + return await workflow.execute_activity( + my_activity, + start_to_close_timeout=timedelta(minutes=5), # Single attempt + schedule_to_close_timeout=timedelta(minutes=30), # Including retries + heartbeat_timeout=timedelta(minutes=2), # Between heartbeats + ) +``` + +## Workflow Failure + +```python +from temporalio import workflow +from temporalio.exceptions import ApplicationError + +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + if some_condition: + raise ApplicationError( + "Cannot process order", + type="BusinessError", + ) + return "success" +``` + +**Note:** Do not use `non_retryable=` with `ApplicationError` inside a worklow (as opposed to an activity). + +## Best Practices + +1. Use specific error types for different failure modes +2. Mark permanent failures as non-retryable +3. Configure appropriate retry policies +4. Log errors before re-raising +5. Use `ActivityError` to catch activity failures in workflows +6. Design code to be idempotent for safe retries (see more at `references/core/patterns.md`) diff --git a/references/python/gotchas.md b/references/python/gotchas.md new file mode 100644 index 0000000..95ebe8a --- /dev/null +++ b/references/python/gotchas.md @@ -0,0 +1,280 @@ +# Python Gotchas + +Python-specific mistakes and anti-patterns. See also [Common Gotchas](references/core/gotchas.md) for language-agnostic concepts. + +## File Organization + +### Importing Activities into Workflow Files + +**The Problem**: The Python sandbox reloads workflow files on every task. Importing heavy activity modules slows down workers. + +```python +# BAD - activities.py gets reloaded constantly +# workflows.py +from activities import my_activity + +@workflow.defn +class MyWorkflow: + pass + +# GOOD - Pass-through import +# workflows.py +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities import my_activity + +@workflow.defn +class MyWorkflow: + pass +``` + +`references/python/determinism-protection.md` contains more info about the Python sandbox. + +### Mixing Workflows and Activities + +```python +# BAD - Everything in one file +# app.py +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self): + await workflow.execute_activity(my_activity, ...) + +@activity.defn +async def my_activity(): + # Heavy imports, I/O, etc. + pass + +# GOOD - Separate files +# workflows.py +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self): + await workflow.execute_activity(my_activity, ...) + +# activities.py +@activity.defn +async def my_activity(): + pass +``` + +## Async vs Sync Activities + +The Temporal Python SDK supports both async and sync activities. See `references/python/sync-vs-async.md` to understand which to choose. Below are important anti-patterns for both aysnc and sync activities. + +### Blocking in Async Activities + +```python +# BAD - Blocks the event loop +@activity.defn +async def process_file(path: str) -> str: + with open(path) as f: # Blocking I/O in async! + return f.read() + +# GOOD Option 1 - Use sync activity with executor +@activity.defn +def process_file(path: str) -> str: + with open(path) as f: + return f.read() + +# Register with executor in worker +Worker( + client, + task_queue="my-queue", + activities=[process_file], + activity_executor=ThreadPoolExecutor(max_workers=10), +) + +# GOOD Option 2 - Use async I/O +@activity.defn +async def process_file(path: str) -> str: + async with aiofiles.open(path) as f: + return await f.read() +``` + +### Missing Executor for Sync Activities + +```python +# BAD - Sync activity REQUIRES executor +@activity.defn +def slow_computation(data: str) -> str: + return heavy_cpu_work(data) + +Worker( + client, + task_queue="my-queue", + activities=[slow_computation], + # Missing activity_executor! --> THIS IMMEDIATELY RAISES AN EXCEPTION! +) + +# GOOD - Provide executor +Worker( + client, + task_queue="my-queue", + activities=[slow_computation], + activity_executor=ThreadPoolExecutor(max_workers=10), +) +``` + +## Wrong Retry Classification + +**Example:** Transient networks errors should be retried. Authentication errors should not be. +See `references/python/error-handling.md` to understand how to classify errors. + +## Heartbeating + +### Forgetting to Heartbeat Long Activities + +```python +# BAD - No heartbeat, can't detect stuck activities +@activity.defn +async def process_large_file(path: str): + async for chunk in read_chunks(path): + process(chunk) # Takes hours, no heartbeat + +# GOOD - Regular heartbeats with progress +@activity.defn +async def process_large_file(path: str): + async for i, chunk in enumerate(read_chunks(path)): + activity.heartbeat(f"Processing chunk {i}") + process(chunk) +``` + +### Heartbeat Timeout Too Short + +```python +# BAD - Heartbeat timeout shorter than processing time +await workflow.execute_activity( + process_chunk, + start_to_close_timeout=timedelta(minutes=30), + heartbeat_timeout=timedelta(seconds=10), # Too short! +) + +# GOOD - Heartbeat timeout allows for processing variance +await workflow.execute_activity( + process_chunk, + start_to_close_timeout=timedelta(minutes=30), + heartbeat_timeout=timedelta(minutes=2), +) +``` + +Set heartbeat timeout as high as acceptable for your use case — each heartbeat counts as an action. + +## Cancellation + +### Not Handling Workflow Cancellation + +```python +# BAD - Cleanup doesn't run on cancellation +@workflow.defn +class BadWorkflow: + @workflow.run + async def run(self) -> None: + await workflow.execute_activity( + acquire_resource, + start_to_close_timeout=timedelta(minutes=5), + ) + await workflow.execute_activity( + do_work, + start_to_close_timeout=timedelta(minutes=5), + ) + await workflow.execute_activity( + release_resource, # Never runs if cancelled! + start_to_close_timeout=timedelta(minutes=5), + ) + +# GOOD - Use try/finally for cleanup +@workflow.defn +class GoodWorkflow: + @workflow.run + async def run(self) -> None: + await workflow.execute_activity( + acquire_resource, + start_to_close_timeout=timedelta(minutes=5), + ) + try: + await workflow.execute_activity( + do_work, + start_to_close_timeout=timedelta(minutes=5), + ) + finally: + # Runs even on cancellation + await workflow.execute_activity( + release_resource, + start_to_close_timeout=timedelta(minutes=5), + ) +``` + +### Not Handling Activity Cancellation + +Activities must **opt in** to receive cancellation. This requires: +1. **Heartbeating** - Cancellation is delivered via heartbeat +2. **Catching the cancellation exception** - Exception is raised when heartbeat detects cancellation + +**Cancellation exceptions:** +- Async activities: `asyncio.CancelledError` +- Sync threaded activities: `temporalio.exceptions.CancelledError` + +```python +# BAD - Activity ignores cancellation +@activity.defn +async def long_activity() -> None: + await do_expensive_work() # Runs to completion even if cancelled +``` + +```python +# GOOD - Heartbeat and catch cancellation +@activity.defn +async def long_activity() -> None: + try: + for item in items: + activity.heartbeat() + await process(item) + except asyncio.CancelledError: + await cleanup() + raise +``` + +## Testing + +### Not Testing Failures + +It is important to make sure workflows work as expected under failure paths in addition to happy paths. Please see `references/python/testing.md` for more info. + +### Not Testing Replay + +Replay tests help you test that you do not have hidden sources of non-determinism bugs in your workflow code, and should be considered in addition to standard testing. Please see `references/python/testing.md` for more info. + +## Timers and Sleep + +### Using asyncio.sleep + +```python +# BAD: asyncio.sleep is not deterministic during replay +import asyncio + +@workflow.defn +class BadWorkflow: + @workflow.run + async def run(self) -> None: + await asyncio.sleep(60) # Non-deterministic! +``` + +```python +# GOOD: Use workflow.sleep for deterministic timers +from temporalio import workflow +from datetime import timedelta + +@workflow.defn +class GoodWorkflow: + @workflow.run + async def run(self) -> None: + await workflow.sleep(timedelta(seconds=60)) # Deterministic + # Or with string duration: + await workflow.sleep("1 minute") +``` + +**Why this matters:** `asyncio.sleep` uses the system clock, which differs between original execution and replay. `workflow.sleep` creates a durable timer in the event history, ensuring consistent behavior during replay. diff --git a/references/python/observability.md b/references/python/observability.md new file mode 100644 index 0000000..26296c3 --- /dev/null +++ b/references/python/observability.md @@ -0,0 +1,105 @@ +# Python SDK Observability + +## Overview + +The Python SDK provides comprehensive observability through logging, metrics, tracing, and visibility (Search Attributes). + +## Logging + +### Workflow Logging (Replay-Safe) + +Use `workflow.logger` for replay-safe logging that avoids duplicate messages: + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self, name: str) -> str: + workflow.logger.info("Workflow started", extra={"name": name}) + + result = await workflow.execute_activity( + my_activity, + start_to_close_timeout=timedelta(minutes=5), + ) + + workflow.logger.info("Activity completed", extra={"result": result}) + return result +``` + +The workflow logger automatically: +- Suppresses duplicate logs during replay +- Includes workflow context (workflow ID, run ID, etc.) + +### Activity Logging + +Use `activity.logger` for context-aware activity logging: + +```python +@activity.defn +async def process_order(order_id: str) -> str: + activity.logger.info(f"Processing order {order_id}") + + # Perform work... + + activity.logger.info("Order processed successfully") + return "completed" +``` + +Activity logger includes: +- Activity ID, type, and task queue +- Workflow ID and run ID +- Attempt number (for retries) + +### Customizing Logger Configuration + +```python +import logging + +# Applies to temporalio.workflow.logger and temporalio.activity.logger, as Temporal inherits the default logger +logging.basicConfig( + level=logging.INFO, + format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", +) +``` + +## Metrics + +### Enabling SDK Metrics + +```python +from temporalio.client import Client +from temporalio.runtime import Runtime, TelemetryConfig, PrometheusConfig + +# Create a custom runtime +runtime = Runtime( + telemetry=TelemetryConfig( + metrics=PrometheusConfig(bind_address="0.0.0.0:9000") + ) +) + +# Set it as the global default BEFORE any Client/Worker is created +# Do this only ONCE. +Runtime.set_default(runtime, error_if_already_set=True) +# error_if_already_set can be False if you want to overwrite an existing default without raising. + +# ...elsewhere, client = ... as usual +``` + +### Key SDK Metrics + +- `temporal_request` - Client requests to server +- `temporal_workflow_task_execution_latency` - Workflow task processing time +- `temporal_activity_execution_latency` - Activity execution time +- `temporal_workflow_task_replay_latency` - Replay duration + + +## Search Attributes (Visibility) + +See the Search Attributes section of `references/python/data-handling.md` + +## Best Practices + +1. Use `workflow.logger` in workflows, `activity.logger` in activities +2. Don't use print() in workflows - it will produce duplicate output on replay +3. Configure metrics for production monitoring +4. Use Search Attributes for business-level visibility diff --git a/references/python/patterns.md b/references/python/patterns.md new file mode 100644 index 0000000..762977b --- /dev/null +++ b/references/python/patterns.md @@ -0,0 +1,390 @@ +# Python SDK Patterns + +## Signals + +```python +@workflow.defn +class OrderWorkflow: + def __init__(self): + self._approved = False + self._items = [] + + @workflow.signal + async def approve(self) -> None: + self._approved = True + + @workflow.signal + async def add_item(self, item: str) -> None: + self._items.append(item) + + @workflow.run + async def run(self) -> str: + # Wait for approval + await workflow.wait_condition(lambda: self._approved) + return f"Processed {len(self._items)} items" +``` + +### Dynamic Signal Handlers + +For handling signals with names not known at compile time. Use cases for this pattern are rare — most workflows should use statically defined signal handlers. + +```python +@workflow.defn +class DynamicSignalWorkflow: + def __init__(self): + self._signals: dict[str, list[Any]] = {} + + @workflow.signal(dynamic=True) + async def handle_signal(self, name: str, args: Sequence[RawValue]) -> None: + if name not in self._signals: + self._signals[name] = [] + self._signals[name].append(workflow.payload_converter().from_payload(args[0])) +``` + +## Queries + +**Important:** Queries must NOT modify workflow state or have side effects. + +```python +@workflow.defn +class StatusWorkflow: + def __init__(self): + self._status = "pending" + self._progress = 0 + + @workflow.query + def get_status(self) -> str: + return self._status + + @workflow.query + def get_progress(self) -> int: + return self._progress + + @workflow.run + async def run(self) -> str: + self._status = "running" + for i in range(100): + self._progress = i + await workflow.execute_activity( + process_item, i, + start_to_close_timeout=timedelta(minutes=1) + ) + self._status = "completed" + return "done" +``` + +### Dynamic Query Handlers + +For handling queries with names not known at compile time. Use cases for this pattern are rare — most workflows should use statically defined query handlers. + +```python +@workflow.query(dynamic=True) +def handle_query(self, name: str, args: Sequence[RawValue]) -> Any: + if name == "get_field": + field_name = workflow.payload_converter().from_payload(args[0]) + return getattr(self, f"_{field_name}", None) +``` + +## Updates + +```python +@workflow.defn +class OrderWorkflow: + def __init__(self): + self._items: list[str] = [] + + @workflow.update + async def add_item(self, item: str) -> int: + self._items.append(item) + return len(self._items) # Returns new count to caller + + @add_item.validator + def validate_add_item(self, item: str) -> None: + if not item: + raise ValueError("Item cannot be empty") + if len(self._items) >= 100: + raise ValueError("Order is full") +``` + +## Child Workflows + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self, orders: list[Order]) -> list[str]: + results = [] + for order in orders: + result = await workflow.execute_child_workflow( + ProcessOrderWorkflow.run, + order, + id=f"order-{order.id}", + # Control what happens to child when parent completes + parent_close_policy=workflow.ParentClosePolicy.ABANDON, + ) + results.append(result) + return results +``` + +## Handles to External Workflows + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self, target_workflow_id: str) -> None: + # Get handle to external workflow + handle = workflow.get_external_workflow_handle(target_workflow_id) + + # Signal the external workflow + await handle.signal(TargetWorkflow.data_ready, data_payload) + + # Or cancel it + await handle.cancel() +``` + +## Parallel Execution + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self, items: list[str]) -> list[str]: + # Execute activities in parallel + tasks = [ + workflow.execute_activity( + process_item, item, + start_to_close_timeout=timedelta(minutes=5) + ) + for item in items + ] + return await asyncio.gather(*tasks) +``` + +### Deterministic Alternatives to asyncio + +Generally, asyncio is OK to use in Temoral workflows. But some asyncio calls are non-deterministic. Use Temporal's deterministic alternatives for safer concurrent operations: + +```python +# workflow.wait() - like asyncio.wait() +done, pending = await workflow.wait( + futures, + return_when=workflow.WaitConditionResult.FIRST_COMPLETED +) + +# workflow.as_completed() - like asyncio.as_completed() +async for future in workflow.as_completed(futures): + result = await future + # Process each result as it completes +``` + +## Continue-as-New + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self, state: WorkflowState) -> str: + while True: + state = await process_batch(state) + + if state.is_complete: + return "done" + + # Continue with fresh history before hitting limits + if workflow.info().is_continue_as_new_suggested(): + workflow.continue_as_new(args=[state]) +``` + +## Saga Pattern (Compensations) + +**Important:** Compensation activities should be idempotent - they may be retried (as with ALL activities). + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + compensations: list[Callable[[], Awaitable[None]]] = [] + + try: + # Note - we save the compensation before running the activity, + # because the following could happen: + # 1. reserve_inventory starts running + # 2. it does successfully reserve inventory + # 3. but then fails for some other reason (timeout, reporting metrics, etc.) + # 4. in that case, the activity would have failed, but we still did the effect of reserving inventory + # So, we need to make sure we have a compensation already on the stack to handle that. + # This means the compensation needs to handle both the cases of reserved or unreserved inventory. + compensations.append(lambda: workflow.execute_activity( + release_inventory_if_reserved, order, + start_to_close_timeout=timedelta(minutes=5) + )) + await workflow.execute_activity( + reserve_inventory, order, + start_to_close_timeout=timedelta(minutes=5) + ) + + compensations.append(lambda: workflow.execute_activity( + refund_payment_if_charged, order, + start_to_close_timeout=timedelta(minutes=5) + )) + await workflow.execute_activity( + charge_payment, order, + start_to_close_timeout=timedelta(minutes=5) + ) + + await workflow.execute_activity( + ship_order, order, + start_to_close_timeout=timedelta(minutes=5) + ) + + return "Order completed" + + except Exception as e: + workflow.logger.error(f"Order failed: {e}, running compensations") + for compensate in reversed(compensations): + try: + await compensate() + except Exception as comp_err: + workflow.logger.error(f"Compensation failed: {comp_err}") + raise +``` + +## Cancellation Handling - leverages standard asyncio cancellation + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + try: + await workflow.execute_activity( + long_running_activity, + start_to_close_timeout=timedelta(hours=1), + ) + return "completed" + except asyncio.CancelledError: + # Workflow was cancelled - perform cleanup + workflow.logger.info("Workflow cancelled, running cleanup") + # Cleanup activities still run even after cancellation + await workflow.execute_activity( + cleanup_activity, + start_to_close_timeout=timedelta(minutes=5), + ) + raise # Re-raise to mark workflow as cancelled +``` + +## Wait Condition with Timeout + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + self._approved = False + + # Wait for approval with 24-hour timeout + try: + await workflow.wait_condition( + lambda: self._approved, + timeout=timedelta(hours=24) + ) + return "approved" + except asyncio.TimeoutError: + return "auto-rejected due to timeout" +``` + +## Waiting for All Handlers to Finish + +Signal and update handlers should generally be non-async (avoid running activities from them). Otherwise, the workflow may complete before handlers finish their execution. However, making handlers non-async sometimes requires workarounds that add complexity. + +When async handlers are necessary, use `wait_condition(all_handlers_finished)` at the end of your workflow (or before continue-as-new) to prevent completion until all pending handlers complete. + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + # ... main workflow logic ... + + # Before exiting, wait for all handlers to finish + await workflow.wait_condition(workflow.all_handlers_finished) + return "done" +``` + +## Activity Heartbeat Details + +### WHY: +- **Support activity cancellation** - Cancellations are delivered via heartbeat; activities that don't heartbeat won't know they've been cancelled +- **Resume progress after worker failure** - Heartbeat details persist across retries + +**Cancellation exceptions:** +- Async activities: `asyncio.CancelledError` +- Sync threaded activities: `temporalio.exceptions.CancelledError` + +### WHEN: +- **Cancellable activities** - Any activity that should respond to cancellation +- **Long-running activities** - Track progress for resumability +- **Checkpointing** - Save progress periodically + +```python +from temporalio.exceptions import CancelledError + +@activity.defn +def process_large_file(file_path: str) -> str: + # Get heartbeat details from previous attempt (if any) + heartbeat_details = activity.info().heartbeat_details + start_line = heartbeat_details[0] if heartbeat_details else 0 + + try: + with open(file_path) as f: + for i, line in enumerate(f): + if i < start_line: + continue # Skip already processed lines + + process_line(line) + + # Heartbeat with progress + # If cancelled, heartbeat() raises CancelledError + activity.heartbeat(i + 1) + + return "completed" + except CancelledError: + # Perform cleanup on cancellation + cleanup() + raise +``` + +## Timers + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + await workflow.sleep(timedelta(hours=1)) + + return "Timer fired" +``` + +## Local Activities + +**Purpose**: Reduce latency for short, lightweight operations by skipping the task queue. ONLY use these when necessary for performance. Do NOT use these by default, as they are not durable and distributed. + +```python +@workflow.defn +class MyWorkflow: + @workflow.run + async def run(self) -> str: + result = await workflow.execute_local_activity( + quick_lookup, + "key", + start_to_close_timeout=timedelta(seconds=5), + ) + return result +``` + +## Using Pydantic Models + +See `references/python/data-handling.md`. diff --git a/references/python/python.md b/references/python/python.md new file mode 100644 index 0000000..130b1eb --- /dev/null +++ b/references/python/python.md @@ -0,0 +1,175 @@ +# Temporal Python SDK Reference + +## Overview + +The Temporal Python SDK (`temporalio`) provides a fully async, type-safe approach to building durable workflows. Python 3.9+ required. Workflows run in a sandbox by default for determinism protection. + +## Quick Demo of Temporal + +**Add Dependency on Temporal:** In the package management system of the Python project you are working on, add a dependency on `temporalio`. + +**activities/greet.py** - Activity definitions (separate file for performance): +```python +from temporalio import activity + +@activity.defn +def greet(name: str) -> str: + return f"Hello, {name}!" +``` + +**workflows/greeting.py** - Workflow definition (import activities through sandbox): +```python +from datetime import timedelta +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities.greet import greet + +@workflow.defn +class GreetingWorkflow: + @workflow.run + async def run(self, name: str) -> str: + return await workflow.execute_activity( + greet, name, start_to_close_timeout=timedelta(seconds=30) + ) +``` + +**worker.py** - Worker setup (imports activity and workflow, runs indefinitely and processes tasks): +```python +import asyncio +import concurrent.futures +from temporalio.client import Client +from temporalio.worker import Worker + +# Import the activity and workflow from our other files +from activities.greet import greet +from workflows.greeting import GreetingWorkflow + +async def main(): + # Create client connected to server at the given address + # This is the default port for `temporal server start-dev` + client = await Client.connect("localhost:7233") + + # Run the worker + with concurrent.futures.ThreadPoolExecutor(max_workers=100) as activity_executor: + worker = Worker( + client, + task_queue="my-task-queue", + workflows=[GreetingWorkflow], + activities=[greet], + activity_executor=activity_executor, + ) + await worker.run() + +if __name__ == "__main__": + asyncio.run(main()) +``` + +**Start the dev server:** Start `temporal server start-dev` in the background. + +**Start the worker:** Start `python worker.py` in the background (appropriately adjust command for your project, like `uv run python worker.py`) + +**starter.py** - Start a workflow execution: +```python +import asyncio +from temporalio.client import Client +import uuid + +# Import the workflow from the previous code +from workflows.greeting import GreetingWorkflow + +async def main(): + # Create client connected to server at the given address + client = await Client.connect("localhost:7233") + + # Execute a workflow + result = await client.execute_workflow(GreetingWorkflow.run, "my name", id=str(uuid.uuid4()), task_queue="my-task-queue") + + print(f"Result: {result}") + +if __name__ == "__main__": + asyncio.run(main()) +``` + +**Run the workflow:** Run `python starter.py` (or uv run, etc.). Should output: `Result: Hello, my-name!`. + + +## Key Concepts + +### Workflow Definition +- Use `@workflow.defn` decorator on class +- Use `@workflow.run` on the entry point method +- Must be async (`async def`) +- Use `@workflow.signal`, `@workflow.query`, `@workflow.update` for handlers + +### Activity Definition +- Use `@activity.defn` decorator +- Can be sync or async functions +- **Default to sync activities** - safer and easier to debug +- Sync activities need `activity_executor` (ThreadPoolExecutor) +- Async activities require async-safe libraries throughout (e.g., `aiohttp` not `requests`) + +See `sync-vs-async.md` for detailed guidance on choosing between sync and async. + +### Worker Setup +- Connect client, create Worker with workflows and activities +- Run the worker +- Activities can specify custom executor + +### Determinism + +**Workflow code must be deterministic!**. All sources of non-determinism should either use Temporal-provided actions or (primarily) be defined in Activities. Read `references/core/determinism.md` and `references/python/determinism.md` to understand more. + +## File Organization Best Practice + +**Keep Workflow definitions in separate files from Activity definitions.** The Python SDK sandbox reloads Workflow definition files on every execution for determinism protection. Minimizing file contents improves Worker performance. + +``` +my_temporal_app/ +├── workflows/ +│ └── greeting.py # Only Workflow classes +├── activities/ +│ └── translate.py # Only Activity functions/classes +├── worker.py # Worker setup, imports both +└── starter.py # Client code to start workflows +``` + +**In the Workflow file, import Activities through the sandbox:** +```python +# workflows/greeting.py +from temporalio import workflow + +with workflow.unsafe.imports_passed_through(): + from activities.translate import TranslateActivities +``` + +## Common Pitfalls + +1. **Non-deterministic code in workflows** - Use activities for all non-deterministic and/or fallible code +2. **Blocking in async activities** - Use sync activities or async-safe libraries only +3. **Missing executor for sync activities** - Add `activity_executor=ThreadPoolExecutor()` +4. **Forgetting to heartbeat** - Long activities need `activity.heartbeat()` +5. **Using gevent** - Incompatible with SDK +6. **Using `print()` in workflows** - Use `workflow.logger` instead for replay-safe logging +7. **Mixing Workflows and Activities in same file** - Causes unnecessary reloads, hurts performance, bad structure +8. **Forgetting to wait on activity calls** - `workflow.execute_activity()` is async; you must eventually await it (directly or via `asyncio.gather()` for parallel execution) + +## Writing Tests + +See `references/python/testing.md` for info on writing tests. + +## Additional Resources + +### Reference Files +- **`references/python/patterns.md`** - Signals, queries, child workflows, saga pattern, etc. +- **`references/python/determinism.md`** - Sandbox behavior, safe alternatives, pass-through pattern, history replay +- **`references/python/gotchas.md`** - Python-specific mistakes and anti-patterns +- **`references/python/error-handling.md`** - ApplicationError, retry policies, non-retryable errors, idempotency +- **`references/python/observability.md`** - Logging, metrics, tracing, Search Attributes +- **`references/python/testing.md`** - WorkflowEnvironment, time-skipping, activity mocking +- **`references/python/sync-vs-async.md`** - Sync vs async activities, event loop blocking, executor configuration +- **`references/python/advanced-features.md`** - Schedules, worker tuning, and more +- **`references/python/data-handling.md`** - Data converters, Pydantic, payload encryption +- **`references/python/versioning.md`** - Patching API, workflow type versioning, Worker Versioning +- **`references/python/determinism-protection.md`** - Python sandbox specifics, forbidden operations, pass-through imports +- **`references/python/ai-patterns.md`** - LLM integration, Pydantic data converter, AI workflow patterns diff --git a/references/python/sync-vs-async.md b/references/python/sync-vs-async.md new file mode 100644 index 0000000..7875582 --- /dev/null +++ b/references/python/sync-vs-async.md @@ -0,0 +1,231 @@ +# Python SDK: Sync vs Async Activities + +## Overview + +The Temporal Python SDK supports multiple ways of implementing Activities: + +- **Asynchronous** using `asyncio` +- **Synchronous multithreaded** using `concurrent.futures.ThreadPoolExecutor` +- **Synchronous multiprocess** using `concurrent.futures.ProcessPoolExecutor` + +Choosing the correct approach is critical—incorrect usage can cause sporadic failures and difficult-to-diagnose bugs. + +## Recommendation: Default to Synchronous + +Activities should be synchronous by default. Use async only when certain the code doesn't block the event loop. + +## The Event Loop Problem + +The Python async event loop runs in a single thread. When any task runs, no other tasks can execute until an `await` is reached. If code makes a blocking call (file I/O, synchronous HTTP, etc.), the entire event loop freezes. + +**Consequences of blocking the event loop:** +- Worker cannot communicate with Temporal Server +- Workflow progress blocks across the worker +- Potential deadlocks and unpredictable behavior +- Difficult-to-diagnose bugs + +## How the SDK Handles Each Type + +### Synchronous Activities + +- Run in the `activity_executor`, which you must provide +- Protected from accidentally blocking the global event loop +- Multiple activities run in parallel via OS thread scheduling +- Thread pool provides preemptive switching between tasks + +```python +from concurrent.futures import ThreadPoolExecutor +from temporalio.worker import Worker + +with ThreadPoolExecutor(max_workers=100) as executor: + worker = Worker( + client, + task_queue="my-queue", + workflows=[MyWorkflow], + activities=[my_sync_activity], + activity_executor=executor, + ) + await worker.run() +``` + +### Asynchronous Activities + +- Share the default asyncio event loop with the Temporal worker +- Any blocking call freezes the entire loop +- Require async-safe libraries throughout + +```python +@activity.defn +async def my_async_activity(name: str) -> str: + # Must use async-safe libraries only + async with aiohttp.ClientSession() as session: + async with session.get(f"http://api.example.com/{name}") as response: + return await response.text() +``` + +## HTTP Libraries: A Critical Choice + +| Library | Type | Safe in Async Activity? | +|---------|------|------------------------| +| `requests` | Blocking | No - blocks event loop | +| `urllib3` | Blocking | No - blocks event loop | +| `aiohttp` | Async | Yes | +| `httpx` | Both | Yes (use async mode) | + +**Example: Wrong way (blocks event loop)** +```python +@activity.defn +async def bad_activity(url: str) -> str: + import requests + response = requests.get(url) # BLOCKS the event loop! + return response.text +``` + +**Example: Correct way (async-safe)** +```python +@activity.defn +async def good_activity(url: str) -> str: + async with aiohttp.ClientSession() as session: + async with session.get(url) as response: + return await response.text() +``` + +## Running Blocking Code in Async Activities + +If blocking code must run in an async activity, offload it to a thread: + +```python +import asyncio + +@activity.defn +async def activity_with_blocking_call() -> str: + # Run blocking code in a thread pool + loop = asyncio.get_event_loop() + result = await loop.run_in_executor(None, blocking_function) + return result + +# Or use asyncio.to_thread (Python 3.9+) +@activity.defn +async def activity_with_blocking_call_v2() -> str: + result = await asyncio.to_thread(blocking_function) + return result +``` + +## When to Use Async Activities + +Use async activities only when: + +1. All code paths are async-safe (no blocking calls) +2. Using async-native libraries (aiohttp, asyncpg, motor, etc.) +3. Performance benefits are needed for I/O-bound operations +4. The team understands async constraints + +## When to Use Sync Activities + +Use sync activities when: + +1. Making HTTP calls with `requests` or similar blocking libraries +2. Performing file I/O operations +3. Using database drivers that aren't async-native +4. Uncertain whether code is async-safe +5. Integrating with legacy or third-party synchronous code + +## Debugging Tip + +If experiencing sporadic bugs, hangs, or timeouts: + +1. Convert async activities to sync +2. Test thoroughly +3. If bugs disappear, the original async activity had blocking calls + +## Threading Considerations + +### Multi-Core Usage + +For CPU-bound work and multi-core usage: + +- Prefer multiple worker processes and/or threaded synchronous activities. +- Use ProcessPoolExecutor for synchronous activities only if you understand and accept the extra complexity and different cancellation semantics. + +### Separate Workers for Workflows vs Activities + +Some teams deploy: +- Workflow-only workers (CPU-bound, need deadlock detection) +- Activity-only workers (I/O-bound, may need more parallelism) + +This prevents resource contention and allows independent scaling. + +## Complete Example: Sync Activity with ThreadPoolExecutor + +```python +import urllib.parse +import requests +from concurrent.futures import ThreadPoolExecutor +from temporalio import activity +from temporalio.client import Client +from temporalio.worker import Worker + +@activity.defn +def greet_in_spanish(name: str) -> str: + """Synchronous activity using requests library.""" + url = f"http://localhost:9999/get-spanish-greeting?name={urllib.parse.quote(name)}" + response = requests.get(url) + return response.text + +async def main(): + client = await Client.connect("localhost:7233", namespace="default") + + with ThreadPoolExecutor(max_workers=100) as executor: + worker = Worker( + client, + task_queue="greeting-tasks", + workflows=[GreetingWorkflow], + activities=[greet_in_spanish], + activity_executor=executor, + ) + await worker.run() +``` + +## Complete Example: Async Activity with aiohttp + +```python +import aiohttp +import urllib.parse +from temporalio import activity +from temporalio.client import Client +from temporalio.worker import Worker + +class TranslateActivities: + def __init__(self, session: aiohttp.ClientSession): + self.session = session + + @activity.defn + async def greet_in_spanish(self, name: str) -> str: + """Async activity using aiohttp - safe for event loop.""" + url = f"http://localhost:9999/get-spanish-greeting?name={urllib.parse.quote(name)}" + async with self.session.get(url) as response: + return await response.text() + +async def main(): + client = await Client.connect("localhost:7233", namespace="default") + + async with aiohttp.ClientSession() as session: + activities = TranslateActivities(session) + worker = Worker( + client, + task_queue="greeting-tasks", + workflows=[GreetingWorkflow], + activities=[activities.greet_in_spanish], + ) + await worker.run() +``` + +## Summary + +| Aspect | Sync Activities | Async Activities | +|--------|-----------------|------------------| +| Default choice | Yes | Only when certain | +| Blocking calls | Safe (runs in thread pool) | Dangerous (blocks event loop) | +| HTTP library | `requests`, `httpx` | `aiohttp`, `httpx` (async) | +| Executor needed | Yes (`ThreadPoolExecutor`) | No | +| Debugging | Easier | Harder (timing issues) | diff --git a/references/python/testing.md b/references/python/testing.md new file mode 100644 index 0000000..63a0d14 --- /dev/null +++ b/references/python/testing.md @@ -0,0 +1,165 @@ +# Python SDK Testing + +## Overview + +You test Temporal Python Workflows using the Temporal testing package plus a normal Python test framework like pytest. The Temporal Python SDK provides `WorkflowEnvironment` for testing workflows in a local environment and `ActivityEnvironment` for isolated activity testing. + +## Workflow Test Environment + +The core pattern is: + +1. Start a test WorkflowEnvironment (`WorkflowEnvironment.start_local()`). +2. Start a Worker in that environment with your Workflow and Activities registered. +3. Use the environment’s client to execute the Workflow, using a fresh UUID for the task queue name and workflow ID. +4. Assert on the result or status. + +`WorkflowEnvironment.start_local` configures a ready-to-go local environment for running and testing workflows: + +```python +import uuid +import pytest + +from temporalio.testing import WorkflowEnvironment +from temporalio.worker import Worker + +from activities import my_activity +from workflows import MyWorkflow + +@pytest.mark.asyncio +async def test_workflow(): + task_queue_name = str(uuid.uuid4()) + async with await WorkflowEnvironment.start_local() as env: + async with Worker( + env.client, + task_queue=task_queue_name, + workflows=[MyWorkflow], + activities=[my_activity], + ): + result = await env.client.execute_workflow( + MyWorkflow.run, + "input", + id=str(uuid.uuid4()), + task_queue=task_queue_name, + ) +``` + +Conveniently, the local `env` can be shared among tests, e.g. via a pytest fixture. + +If your workflows / tests involve long durations (such as using Temporal timers / sleeps), then you can use the time-skipping environment, via `WorkflowEnvironment.start_time_skipping()`. +Only use time-skipping if you must. It can *not* be shared among tests. + +## Mocking Activities + +```python +import uuid +import pytest + +from temporalio import activity +from temporalio.testing import WorkflowEnvironment +from temporalio.worker import Worker + +from workflows import MyWorkflow + +@activity.defn(name="compose_greeting") +async def compose_greeting_mocked(input: str) -> str: + return "mocked result" + +@pytest.mark.asyncio +async def test_with_mock(): + task_queue_name = str(uuid.uuid4()) + async with await WorkflowEnvironment.start_local() as env: + async with Worker( + env.client, + task_queue=task_queue_name, + workflows=[MyWorkflow], + activities=[compose_greeting_mocked], + ): + result = await env.client.execute_workflow(...) +``` + +## Testing Signals and Queries + +```python +@pytest.mark.asyncio +async def test_signals(): + async with await WorkflowEnvironment.start_local() as env: + async with Worker(...): + handle = await env.client.start_workflow(...) # same arguments as to execute_workflow + + # Send signal + await handle.signal(MyWorkflow.my_signal, "data") + + # Query state + status = await handle.query(MyWorkflow.get_status) + assert status == "expected" + + # Wait for completion + result = await handle.result() +``` + +## Testing Failure Cases + +Below shows an example of how to test failure cases: + +```python +# Test failure scenarios +@pytest.mark.asyncio +async def test_activity_failure_handling(): + async with await WorkflowEnvironment.start_local() as env: + # An example activity that always fails + @activity.defn + async def failing_activity() -> str: + raise ApplicationError("Simulated failure", non_retryable=True) + + async with Worker(...): + with pytest.raises(WorkflowFailureError): + await env.client.execute_workflow(...) +``` + +## Workflow Replay Testing + +```python +import json +import pytest +import uuid +from temporalio.client import WorkflowHistory +from temporalio.worker import Replayer + +from workflows import MyWorkflow + +@pytest.mark.asyncio +async def test_replay(): + with open("example-history.json", "r") as f: + history_json = json.load(f) + + replayer = Replayer(workflows=[MyWorkflow]) + + # From JSON file + await replayer.replay_workflow( + WorkflowHistory.from_json(workflow_id=str(uuid.uuid4()), history_json) + ) +``` + + +## Activity Testing + +```python +import pytest + +from temporalio.testing import ActivityEnvironment + +@pytest.mark.asyncio +async def test_activity(): + env = ActivityEnvironment() + result = await env.run(my_activity, "arg1", "arg2") + assert result == "expected" +``` + +## Best Practices + +1. Use the `WorkflowEnvironment.start_local` environment for most testing +2. Use time-skipping environment for workflows with durable timers / durable sleeps. +3. Mock external dependencies in activities +4. Test replay compatibility, especially when changing workflow code +5. Test signal/query handlers explicitly +6. Use unique workflow IDs and task queues per test to avoid conflicts. Easiest is a `uuid.uuid4()` diff --git a/references/python/versioning.md b/references/python/versioning.md new file mode 100644 index 0000000..abd4445 --- /dev/null +++ b/references/python/versioning.md @@ -0,0 +1,314 @@ +# Python SDK Versioning + +For conceptual overview and guidance on choosing an approach, see `references/core/versioning.md`. + +## Patching API + +### The patched() Function + +The `patched()` function checks whether a Workflow should run new or old code: + +```python +from temporalio import workflow + +@workflow.defn +class ShippingWorkflow: + @workflow.run + async def run(self) -> None: + if workflow.patched("send-email-instead-of-fax"): + # New code path + await workflow.execute_activity( + send_email, + start_to_close_timeout=timedelta(minutes=5), + ) + else: + # Old code path (for replay of existing workflows) + await workflow.execute_activity( + send_fax, + start_to_close_timeout=timedelta(minutes=5), + ) +``` + +**How it works:** +- For new executions: `patched()` returns `True` and records a marker in the Workflow history +- For replay with the marker: `patched()` returns `True` (history includes this patch) +- For replay without the marker: `patched()` returns `False` (history predates this patch) + +**Python-specific behavior:** The `patched()` return value is memoized on first call. This means you cannot reliably use `patched()` in loops—it will return the same value every iteration. Workaround: append a sequence number to the patch ID for each iteration (e.g., `f"my-change-{i}"`). + +### Three-Step Patching Process + +Patching is a three-step process for safely deploying changes. + +**Warning:** Failing to follow this process correctly will result in non-determinism errors for in-flight workflows. + +**Step 1: Patch in New Code** + +Add the patch with both old and new code paths: + +```python +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + if workflow.patched("add-fraud-check"): + # New: Run fraud check before payment + await workflow.execute_activity( + check_fraud, + order, + start_to_close_timeout=timedelta(minutes=2), + ) + + # Original payment logic runs for both paths + return await workflow.execute_activity( + process_payment, + order, + start_to_close_timeout=timedelta(minutes=5), + ) +``` + +**Step 2: Deprecate the Patch** + +Once all pre-patch Workflow Executions have completed, remove the old code and use `deprecate_patch()`: + +```python +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + workflow.deprecate_patch("add-fraud-check") + + # Only new code remains + await workflow.execute_activity( + check_fraud, + order, + start_to_close_timeout=timedelta(minutes=2), + ) + + return await workflow.execute_activity( + process_payment, + order, + start_to_close_timeout=timedelta(minutes=5), + ) +``` + +**Step 3: Remove the Patch** + +After all workflows with the deprecated patch marker have completed, remove the `deprecate_patch()` call entirely: + +```python +@workflow.defn +class OrderWorkflow: + @workflow.run + async def run(self, order: Order) -> str: + await workflow.execute_activity( + check_fraud, + order, + start_to_close_timeout=timedelta(minutes=2), + ) + + return await workflow.execute_activity( + process_payment, + order, + start_to_close_timeout=timedelta(minutes=5), + ) +``` + +### Query Filters for Finding Workflows by Version + +Use List Filters to find workflows with specific patch versions: + +```bash +# Find running workflows with a specific patch +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND ExecutionStatus = "Running" AND TemporalChangeVersion = "add-fraud-check"' + +# Find running workflows without any patch (pre-patch versions) +temporal workflow list --query \ + 'WorkflowType = "OrderWorkflow" AND ExecutionStatus = "Running" AND TemporalChangeVersion IS NULL' +``` + +## Workflow Type Versioning + +For incompatible changes, create a new Workflow Type instead of using patches: + +```python +@workflow.defn(name="PizzaWorkflow") +class PizzaWorkflow: + @workflow.run + async def run(self, order: PizzaOrder) -> str: + # Original implementation + return await self._process_order_v1(order) + +@workflow.defn(name="PizzaWorkflowV2") +class PizzaWorkflowV2: + @workflow.run + async def run(self, order: PizzaOrder) -> str: + # New implementation with incompatible changes + return await self._process_order_v2(order) +``` + +Register both with the Worker: + +```python +worker = Worker( + client, + task_queue="pizza-task-queue", + workflows=[PizzaWorkflow, PizzaWorkflowV2], + activities=[make_pizza, deliver_pizza], +) +``` + +Update client code to start new workflows with the new type: + +```python +# Old workflows continue on PizzaWorkflow +# New workflows use PizzaWorkflowV2 +handle = await client.start_workflow( + PizzaWorkflowV2.run, + order, + id=f"pizza-{order.id}", + task_queue="pizza-task-queue", +) +``` + +Check for open executions before removing the old type: + +```bash +temporal workflow list --query 'WorkflowType = "PizzaWorkflow" AND ExecutionStatus = "Running"' +``` + +## Worker Versioning + +Worker Versioning manages versions at the deployment level, allowing multiple Worker versions to run simultaneously. + +### Key Concepts + +**Worker Deployment**: A logical service grouping similar Workers together (e.g., "loan-processor"). All versions of your code live under this umbrella. + +**Worker Deployment Version**: A specific snapshot of your code identified by a deployment name and Build ID (e.g., "loan-processor:v1.0" or "loan-processor:abc123"). + +### Configuring Workers for Versioning + +```python +from temporalio.worker import Worker +from temporalio.worker.deployment_config import ( + WorkerDeploymentConfig, + WorkerDeploymentVersion, +) + +worker = Worker( + client, + task_queue="my-task-queue", + workflows=[MyWorkflow], + activities=[my_activity], + deployment_config=WorkerDeploymentConfig( + version=WorkerDeploymentVersion( + deployment_name="my-service", + build_id="v1.0.0", # or git commit hash + ), + use_worker_versioning=True, + ), +) +``` + +**Configuration parameters:** +- `use_worker_versioning`: Enables Worker Versioning +- `version`: Identifies the Worker Deployment Version (deployment name + build ID) +- Build ID: Typically a git commit hash, version number, or timestamp + +### PINNED vs AUTO_UPGRADE Behaviors + +**PINNED Behavior** + +Workflows stay locked to their original Worker version: + +```python +from temporalio.workflow import VersioningBehavior + +@workflow.defn +class StableWorkflow: + @workflow.run + async def run(self) -> str: + # This workflow will always run on its assigned version + return await workflow.execute_activity( + process_order, + start_to_close_timeout=timedelta(minutes=5), + ) +``` + +**When to use PINNED:** +- Short-running workflows (minutes to hours) +- Consistency is critical (e.g., financial transactions) +- You want to eliminate version compatibility complexity +- Building new applications and want simplest development experience + +**AUTO_UPGRADE Behavior** + +Workflows can move to newer versions: + +**When to use AUTO_UPGRADE:** +- Long-running workflows (weeks or months) +- Workflows need to benefit from bug fixes during execution +- Migrating from traditional rolling deployments +- You are already using patching APIs for version transitions + +**Important:** AUTO_UPGRADE workflows still need patching to handle version transitions safely since they can move between Worker versions. + +### Worker Configuration with Default Behavior + +```python +# For short-running workflows, prefer PINNED +worker = Worker( + client, + task_queue="orders-task-queue", + workflows=[OrderWorkflow], + activities=[process_order], + deployment_config=WorkerDeploymentConfig( + version=WorkerDeploymentVersion( + deployment_name="order-service", + build_id=os.environ["BUILD_ID"], + ), + use_worker_versioning=True, + # default_versioning_behavior=VersioningBehavior.PINNED, + ), +) +``` + +### Deployment Strategies + +**Blue-Green Deployments** + +Maintain two environments and switch traffic between them: +1. Deploy new code to idle environment +2. Run tests and validation +3. Switch traffic to new environment +4. Keep old environment for instant rollback + +**Rainbow Deployments** + +Multiple versions run simultaneously: +- New workflows use latest version +- Existing workflows complete on their original version +- Add new versions alongside existing ones +- Gradually sunset old versions as workflows complete + +This works well with Kubernetes where you manage multiple ReplicaSets running different Worker versions. + +### Querying Workflows by Worker Version + +```bash +# Find workflows on a specific Worker version +temporal workflow list --query \ + 'TemporalWorkerDeploymentVersion = "my-service:v1.0.0" AND ExecutionStatus = "Running"' +``` + +## Best Practices + +1. **Check for open executions** before removing old code paths +2. **Use descriptive patch IDs** that explain the change (e.g., "add-fraud-check" not "patch-1") +3. **Deploy patches incrementally**: patch, deprecate, remove +4. **Use PINNED for short workflows** to simplify version management +5. **Use AUTO_UPGRADE with patching** for long-running workflows that need updates +6. **Generate Build IDs from code** (git hash) to ensure changes produce new versions +7. **Avoid rolling deployments** for high-availability services with long-running workflows diff --git a/references/typescript/advanced-features.md b/references/typescript/advanced-features.md new file mode 100644 index 0000000..17b7e61 --- /dev/null +++ b/references/typescript/advanced-features.md @@ -0,0 +1,150 @@ +# TypeScript SDK Advanced Features + +## Schedules + +Create recurring workflow executions. + +```typescript +import { Client, ScheduleOverlapPolicy } from '@temporalio/client'; + +const client = new Client(); + +// Create a schedule +const schedule = await client.schedule.create({ + scheduleId: 'daily-report', + spec: { + intervals: [{ every: '1 day' }], + }, + action: { + type: 'startWorkflow', + workflowType: 'dailyReportWorkflow', + taskQueue: 'reports', + args: [], + }, + policies: { + overlap: ScheduleOverlapPolicy.SKIP, + }, +}); + +// Manage schedules +const handle = client.schedule.getHandle('daily-report'); +await handle.pause('Maintenance window'); +await handle.unpause(); +await handle.trigger(); // Run immediately +await handle.delete(); +``` + +## Async Activity Completion + +Complete an activity asynchronously from outside the activity function. Useful when the activity needs to wait for an external event. + +**In the activity - return the task token:** +```typescript +import { CompleteAsyncError, activityInfo } from '@temporalio/activity'; + +export async function doSomethingAsync(): Promise { + const taskToken: Uint8Array = activityInfo().taskToken; + setTimeout(() => doSomeWork(taskToken), 1000); + throw new CompleteAsyncError(); +} +``` + +**External completion (from another process, machine, etc.):** +```typescript +import { Client } from '@temporalio/client'; + +async function doSomeWork(taskToken: Uint8Array): Promise { + const client = new Client(); + // does some work... + await client.activity.complete(taskToken, "Job's done!"); +} +``` + +**When to use:** +- Waiting for human approval +- Waiting for external webhook callback +- Long-polling external systems + +## Worker Tuning + +Configure worker capacity for production workloads: + +```typescript +import { Worker, NativeConnection } from '@temporalio/worker'; + +const worker = await Worker.create({ + connection: await NativeConnection.connect({ address: 'temporal:7233' }), + taskQueue: 'my-queue', + workflowBundle: { codePath: require.resolve('./workflow-bundle.js') }, // Pre-bundled for production + activities, + + // Workflow execution concurrency (default: 40) + maxConcurrentWorkflowTaskExecutions: 100, + + // Activity execution concurrency (default: 100) + maxConcurrentActivityTaskExecutions: 200, + + // Graceful shutdown timeout (default: 0) + shutdownGraceTime: '30 seconds', + + // Max cached workflows (memory vs latency tradeoff) + maxCachedWorkflows: 1000, +}); +``` + +**Key settings:** +- `maxConcurrentWorkflowTaskExecutions`: Max workflows running simultaneously (default: 40) +- `maxConcurrentActivityTaskExecutions`: Max activities running simultaneously (default: 100) +- `shutdownGraceTime`: Time to wait for in-progress work before forced shutdown +- `maxCachedWorkflows`: Number of workflows to keep in cache (reduces replay on cache hit) + +## Sinks + +Sinks allow workflows to emit events for side effects (logging, metrics). + +```typescript +import { proxySinks, Sinks } from '@temporalio/workflow'; + +// Define sink interface +export interface LoggerSinks extends Sinks { + logger: { + info(message: string, attrs: Record): void; + error(message: string, attrs: Record): void; + }; +} + +// Use in workflow +const { logger } = proxySinks(); + +export async function myWorkflow(input: string): Promise { + logger.info('Workflow started', { input }); + + const result = await someActivity(input); + + logger.info('Workflow completed', { result }); + return result; +} + +// Implement sink in worker +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), // Use workflowBundle for production + activities, + taskQueue: 'my-queue', + sinks: { + logger: { + info: { + fn(workflowInfo, message, attrs) { + console.log(`[${workflowInfo.workflowId}] ${message}`, attrs); + }, + callDuringReplay: false, // Don't log during replay + }, + error: { + fn(workflowInfo, message, attrs) { + console.error(`[${workflowInfo.workflowId}] ${message}`, attrs); + }, + callDuringReplay: false, + }, + }, + }, +}); +``` diff --git a/references/typescript/data-handling.md b/references/typescript/data-handling.md new file mode 100644 index 0000000..bfd4925 --- /dev/null +++ b/references/typescript/data-handling.md @@ -0,0 +1,253 @@ +# TypeScript SDK Data Handling + +## Overview + +The TypeScript SDK uses data converters to serialize/deserialize workflow inputs, outputs, and activity parameters. + +## Default Data Converter + +The default converter handles: +- `undefined` and `null` +- `Uint8Array` (as binary) +- JSON-serializable types + +Note: Protobuf support requires using a data converter (`DefaultPayloadConverterWithProtobufs`). See the Protobuf Support section below. + +## Custom Data Converter + +Create custom converters for special serialization needs. + +```typescript +// payload-converter.ts +import { + PayloadConverter, + Payload, + defaultPayloadConverter, +} from '@temporalio/common'; + +class CustomPayloadConverter implements PayloadConverter { + toPayload(value: T): Payload | undefined { + // Custom serialization logic + return defaultPayloadConverter.toPayload(value); + } + + fromPayload(payload: Payload): T { + // Custom deserialization logic + return defaultPayloadConverter.fromPayload(payload); + } +} + +export const payloadConverter = new CustomPayloadConverter(); +``` + +```typescript +// client.ts +import { Client } from '@temporalio/client'; + +const client = new Client({ + dataConverter: { + payloadConverterPath: require.resolve('./payload-converter'), + }, +}); +``` + +```typescript +// worker.ts +import { Worker } from '@temporalio/worker'; + +const worker = await Worker.create({ + dataConverter: { + payloadConverterPath: require.resolve('./payload-converter'), + }, + // ... +}); +``` + +## Composition of Payload Converters + +```typescript +import { CompositePayloadConverter } from '@temporalio/common'; + +// The order matters — converters are tried in sequence until one returns a non-null Payload +export const payloadConverter = new CompositePayloadConverter( + new PayloadConverterFoo(), + new PayloadConverterBar(), +); +``` + +## Protobuf Support + +Using Protocol Buffers for type-safe serialization. + +**Note:** JSON serialization (the default) is preferred for TypeScript applications—it's simpler and more performant. Use Protobuf only when interoperating with services that require it. + +```typescript +import { DefaultPayloadConverterWithProtobufs } from '@temporalio/common/lib/protobufs'; + +const dataConverter: DataConverter = { + payloadConverter: new DefaultPayloadConverterWithProtobufs({ + protobufRoot: myProtobufRoot, + }), +}; +``` + +## Payload Codec (Encryption) + +Encrypt sensitive workflow data. + +```typescript +import { PayloadCodec, Payload } from '@temporalio/common'; + +class EncryptionCodec implements PayloadCodec { + private readonly encryptionKey: Uint8Array; + + constructor(key: Uint8Array) { + this.encryptionKey = key; + } + + async encode(payloads: Payload[]): Promise { + return Promise.all( + payloads.map(async (payload) => ({ + metadata: { + encoding: 'binary/encrypted', + }, + data: await this.encrypt(payload.data ?? new Uint8Array()), + })) + ); + } + + async decode(payloads: Payload[]): Promise { + return Promise.all( + payloads.map(async (payload) => { + if (payload.metadata?.encoding === 'binary/encrypted') { + return { + ...payload, + data: await this.decrypt(payload.data ?? new Uint8Array()), + }; + } + return payload; + }) + ); + } + + private async encrypt(data: Uint8Array): Promise { + // Implement encryption (e.g., using Web Crypto API) + return data; + } + + private async decrypt(data: Uint8Array): Promise { + // Implement decryption + return data; + } +} + +// Apply codec +const dataConverter: DataConverter = { + payloadCodecs: [new EncryptionCodec(encryptionKey)], +}; +``` + +## Search Attributes + +Custom searchable fields for workflow visibility. + +### Setting Search Attributes at Start + +```typescript +import { Client } from '@temporalio/client'; + +const client = new Client(); + +await client.workflow.start('orderWorkflow', { + taskQueue: 'orders', + workflowId: `order-${orderId}`, + args: [order], + searchAttributes: { + OrderId: [orderId], + CustomerType: ['premium'], + OrderTotal: [99.99], + CreatedAt: [new Date()], + }, +}); +``` + +### Upserting Search Attributes from Workflow + +```typescript +import { upsertSearchAttributes, workflowInfo } from '@temporalio/workflow'; + +export async function orderWorkflow(order: Order): Promise { + // Update status as workflow progresses + upsertSearchAttributes({ + OrderStatus: ['processing'], + }); + + await processOrder(order); + + upsertSearchAttributes({ + OrderStatus: ['completed'], + }); + + return 'done'; +} +``` + +### Reading Search Attributes + +```typescript +import { workflowInfo } from '@temporalio/workflow'; + +export async function orderWorkflow(): Promise { + const info = workflowInfo(); + const searchAttrs = info.searchAttributes; + const orderId = searchAttrs?.OrderId?.[0]; + // ... +} +``` + +### Querying Workflows by Search Attributes + +```typescript +const client = new Client(); + +// List workflows using search attributes +for await (const workflow of client.workflow.list({ + query: 'OrderStatus = "processing" AND CustomerType = "premium"', +})) { + console.log(`Workflow ${workflow.workflowId} is still processing`); +} +``` + +## Workflow Memo + +Store arbitrary metadata with workflows (not searchable). + +```typescript +// Set memo at workflow start +await client.workflow.start('orderWorkflow', { + taskQueue: 'orders', + workflowId: `order-${orderId}`, + args: [order], + memo: { + customerName: order.customerName, + notes: 'Priority customer', + }, +}); + +// Read memo from workflow +import { workflowInfo } from '@temporalio/workflow'; + +export async function orderWorkflow(): Promise { + const info = workflowInfo(); + const customerName = info.memo?.customerName; + // ... +} +``` + +## Best Practices + +1. Keep payloads small—see `references/core/gotchas.md` for limits +2. Use search attributes for business-level visibility and filtering +3. Encrypt sensitive data with PayloadCodec +4. Use memo for non-searchable metadata +5. Configure the same data converter on both client and worker diff --git a/references/typescript/determinism-protection.md b/references/typescript/determinism-protection.md new file mode 100644 index 0000000..54303ba --- /dev/null +++ b/references/typescript/determinism-protection.md @@ -0,0 +1,56 @@ +# TypeScript Workflow V8 Sandboxing + +## Overview + +The TypeScript SDK runs workflows in a V8 sandbox that provides automatic protection against non-deterministic operations, and replaces common non-deterministic function calls with deterministic variants. + +## Import Blocking + +The sandbox blocks imports of `fs`, `https` modules, and any Node/DOM APIs. Otherwise, workflow code can import any package as long as it does not reference Node.js or DOM APIs. + +**Note**: If you must use a library that references a Node.js or DOM API and you are certain that those APIs are not used at runtime, add that module to the `ignoreModules` list: + +```ts +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), // bundlerOptions only apply with workflowsPath + activities: require('./activities'), + taskQueue: 'my-task-queue', + bundlerOptions: { + // These modules may be imported (directly or transitively), + // but will be excluded from the Workflow bundle. + ignoreModules: ['fs', 'http', 'crypto'], + }, +}); +``` + +**Important**: Excluded modules are completely unavailable at runtime. Any attempt to call functions from these modules will throw an error. Only exclude modules when you are certain the code paths using them will never execute during workflow execution. + +**Note**: Modules with the `node:` prefix (e.g., `node:fs`) require additional webpack configuration to ignore. You may need to configure the bundler's `externals` or use webpack `resolve.alias` to handle these imports. + +Use this with *extreme caution*. + + +## Function Replacement + +Functions like `Math.random()`, `Date`, and `setTimeout()` are replaced by deterministic versions. + +Date-related functions return the timestamp at which the current workflow task was initially executed. That timestamp remains the same when the workflow task is replayed, and only advances when a durable operation occurs (like `sleep()`). For example: + +```ts +import { sleep } from '@temporalio/workflow'; + +// this prints the *exact* same timestamp repeatedly +for (let x = 0; x < 10; ++x) { + console.log(Date.now()); +} + +// this prints timestamps increasing roughly 1s each iteration +for (let x = 0; x < 10; ++x) { + await sleep('1 second'); + console.log(Date.now()); +} +``` + +Generally, this is the behavior you want. + +Additionally, `FinalizationRegistry` and `WeakRef` are removed because v8's garbage collector is not deterministic. diff --git a/references/typescript/determinism.md b/references/typescript/determinism.md new file mode 100644 index 0000000..47f8948 --- /dev/null +++ b/references/typescript/determinism.md @@ -0,0 +1,51 @@ +# TypeScript SDK Determinism + +## Overview + +The TypeScript SDK runs workflows in an isolated V8 sandbox that automatically provides determinism. + +## Why Determinism Matters + +Temporal provides durable execution through **History Replay**. When a Worker needs to restore workflow state (after a crash, cache eviction, or to continue after a long timer), it re-executes the workflow code from the beginning, which requires the workflow code to be **deterministic**. + +## Temporal's V8 Sandbox + +The Temporal TypeScript SDK executes all workflow code in sandbox, which (among other things), replaces common non-deterministic functions with deterministic variants. As an example, consider the code below: + +```ts +export async function myWorkflow(): Promise { + await importData(); + + if (Math.random() > 0.5) { + await sleep('30 minutes'); + } + + return await sendReport(); +} +``` + +The Temporal workflow sandbox will use the same random seed when replaying a workflow, so the above code will **deterministically** generate pseudo-random numbers. For UUIDs, use `uuid4()` from `@temporalio/workflow` which also uses the seeded PRNG. + +See `references/typescript/determinism-protection.md` for more information about the sandbox. + +## Forbidden Operations + +```typescript +// DO NOT do these in workflows: +import fs from 'fs'; // Node.js modules +fetch('https://...'); // Network I/O +``` + +Most non-determinism and side effects, such as the above, should be wrapped in Activities. + +## Testing Replay Compatibility + +Use `Worker.runReplayHistory()` to verify your code changes are compatible with existing histories. See the Workflow Replay Testing section of `references/typescript/testing.md`. + +## Best Practices + +1. Use type-only imports for activities in workflow files +2. Match all @temporalio package versions +3. Prefer `sleep()` from workflow package — `setTimeout` works but `sleep()` handles cancellation scopes more clearly +4. Keep workflows focused on orchestration +5. Test with replay to verify determinism diff --git a/references/typescript/error-handling.md b/references/typescript/error-handling.md new file mode 100644 index 0000000..7072fbd --- /dev/null +++ b/references/typescript/error-handling.md @@ -0,0 +1,119 @@ +# TypeScript SDK Error Handling + +## Overview + +The TypeScript SDK uses `ApplicationFailure` for application errors with support for non-retryable marking. + +## Application Failures + +```typescript +import { ApplicationFailure } from '@temporalio/workflow'; + +export async function myWorkflow(): Promise { + throw ApplicationFailure.create({ + message: 'Invalid input', + type: 'ValidationError', + nonRetryable: true, + }); +} +``` + +## Activity Errors + +```typescript +import { ApplicationFailure } from '@temporalio/activity'; + +export async function validateActivity(input: string): Promise { + if (!isValid(input)) { + throw ApplicationFailure.create({ + message: `Invalid input: ${input}`, + type: 'ValidationError', + nonRetryable: true, + }); + } +} +``` + +## Handling Errors in Workflows + +```typescript +import { proxyActivities, ApplicationFailure, log } from '@temporalio/workflow'; +import type * as activities from './activities'; + +const { riskyActivity } = proxyActivities({ + startToCloseTimeout: '5 minutes', +}); + +export async function workflowWithErrorHandling(): Promise { + try { + return await riskyActivity(); + } catch (err) { + if (err instanceof ApplicationFailure) { + log.warn('Activity failed', { type: err.type, message: err.message }); + } + throw err; + } +} +``` + +## Retry Configuration + +```typescript +const { myActivity } = proxyActivities({ + startToCloseTimeout: '10 minutes', + retry: { + initialInterval: '1s', + backoffCoefficient: 2, + maximumInterval: '1m', + maximumAttempts: 5, + nonRetryableErrorTypes: ['ValidationError', 'PaymentError'], + }, +}); +``` + +**Note:** Only set retry options if you have a domain-specific reason to. The defaults are suitable for most use cases. + +## Timeout Configuration + +```typescript +const { myActivity } = proxyActivities({ + startToCloseTimeout: '5 minutes', // Single attempt + scheduleToCloseTimeout: '30 minutes', // Including retries + heartbeatTimeout: '30 seconds', // Between heartbeats +}); +``` + +## Workflow Failure + +Workflows can throw errors to indicate failure: + +```typescript +import { ApplicationFailure } from '@temporalio/workflow'; + +export async function myWorkflow(): Promise { + if (someCondition) { + throw ApplicationFailure.create({ + message: 'Workflow failed due to invalid state', + type: 'InvalidStateError', + }); + } + return 'success'; +} +``` + +**Warning:** Do NOT use `nonRetryable: true` for workflow failures in most cases. Unlike activities, workflow retries are controlled by the caller, not retry policies. Use `nonRetryable` only for errors that are truly unrecoverable (e.g., invalid input that will never be valid). + +## Idempotency + +For idempotency patterns (using keys, making activities granular), see `core/patterns.md`. + +## Best Practices + +1. Use specific error types for different failure modes +2. Set `nonRetryable: true` for permanent failures in activities +3. Configure `nonRetryableErrorTypes` in retry policy +4. Log errors before re-raising +5. Use `ApplicationFailure` to catch activity failures in workflows +6. Use the appropriate `log` import for your context: + - In workflows: `import { log } from '@temporalio/workflow'` (replay-safe) + - In activities: `import { log } from '@temporalio/activity'` diff --git a/references/typescript/gotchas.md b/references/typescript/gotchas.md new file mode 100644 index 0000000..d234f74 --- /dev/null +++ b/references/typescript/gotchas.md @@ -0,0 +1,312 @@ +# TypeScript Gotchas + +TypeScript-specific mistakes and anti-patterns. See also [Common Gotchas](../core/gotchas.md) for language-agnostic concepts. + +## Activity Imports + +### Importing Implementations Instead of Types + +**The Problem**: Importing activity implementations brings Node.js code into the V8 workflow sandbox, causing bundling errors or runtime failures. + +```typescript +// BAD - Brings actual code into workflow sandbox +import * as activities from './activities'; + +const { greet } = proxyActivities({ + startToCloseTimeout: '1 minute', +}); + +// GOOD - Type-only import +import type * as activities from './activities'; + +const { greet } = proxyActivities({ + startToCloseTimeout: '1 minute', +}); +``` + +### Importing Node.js Modules in Workflows + +```typescript +// BAD - fs is not available in workflow sandbox +import * as fs from 'fs'; + +export async function myWorkflow(): Promise { + const data = fs.readFileSync('file.txt'); // Will fail! +} + +// GOOD - File I/O belongs in activities +export async function myWorkflow(): Promise { + const data = await activities.readFile('file.txt'); +} +``` + +## Bundling Issues + +### Using workflowsPath in Production + +`workflowsPath` runs the bundler at Worker startup, which is slow and not suitable for production. Use `workflowBundle` with pre-bundled code instead. + +```typescript +// OK for development/testing, BAD for production - bundles at startup +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + // ... +}); + +// GOOD for production - use pre-bundled code +import { bundleWorkflowCode } from '@temporalio/worker'; + +// Build step (run once at build time) +const bundle = await bundleWorkflowCode({ + workflowsPath: require.resolve('./workflows'), +}); +await fs.promises.writeFile('./workflow-bundle.js', bundle.code); + +// Worker startup (fast, no bundling) +const worker = await Worker.create({ + workflowBundle: { + codePath: require.resolve('./workflow-bundle.js'), + }, + // ... +}); +``` + +### Missing Dependencies in Workflow Bundle + +```typescript +// If using external packages in workflows, ensure they're bundled + +// worker.ts +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), + bundlerOptions: { + // Exclude Node.js-only packages that cause bundling errors + // WARNING: Modules listed here will be completely unavailable + // at workflow runtime - any imports will fail + ignoreModules: ['some-node-only-package'], + }, +}); +``` + +### Package Version Mismatches + +All `@temporalio/*` packages must have the same version. This can be verified by running `npm ls` or the appropriate command for your package manager. + +### Package Version Constraints - Prod vs. Non-Prod + +For production apps, you should use ~ version constraints (bug fixes only) on Temporal packages. For non-production apps, you may use ^ constraints (the npm default) instead. + +## Wrong Retry Classification + +A common mistake is treating transient errors as permanent (or vice versa): + +- **Transient errors** (retry): network timeouts, temporary service unavailability, rate limits +- **Permanent errors** (don't retry): invalid input, authentication failure, resource not found + +```typescript +// BAD: Retrying a permanent error +throw ApplicationFailure.create({ message: 'User not found' }); +// This will retry indefinitely! + +// GOOD: Mark permanent errors as non-retryable +throw ApplicationFailure.nonRetryable('User not found'); +``` + +For detailed guidance on error classification and retry policies, see `error-handling.md`. + +## Cancellation + +### Not Handling Workflow Cancellation + +```typescript +// BAD - Cleanup doesn't run on cancellation +export async function workflowWithCleanup(): Promise { + await activities.acquireResource(); + await activities.doWork(); + await activities.releaseResource(); // Never runs if cancelled! +} + +// GOOD - Use CancellationScope for cleanup +import { CancellationScope } from '@temporalio/workflow'; + +export async function workflowWithCleanup(): Promise { + await activities.acquireResource(); + try { + await activities.doWork(); + } finally { + // Run cleanup even on cancellation + await CancellationScope.nonCancellable(async () => { + await activities.releaseResource(); + }); + } +} +``` + +### Not Handling Activity Cancellation + +Activities must **opt in** to receive cancellation. This requires: +1. **Heartbeating** - Cancellation is delivered via heartbeat +2. **Checking for cancellation** - Either await `Context.current().cancelled` or use `cancellationSignal()` + +```typescript +// BAD - Activity ignores cancellation +export async function longActivity(): Promise { + await doExpensiveWork(); // Runs to completion even if cancelled +} +``` + +```typescript +// GOOD - Heartbeat in background and race work against cancellation promise +import { Context, CancelledFailure } from '@temporalio/activity'; + +export async function longActivity(): Promise { + // Heartbeat in background so cancellation can be delivered + let heartbeatEnabled = true; + (async () => { + while (heartbeatEnabled) { + await Context.current().sleep(5000); + Context.current().heartbeat(); + } + })().catch(() => {}); + + try { + await Promise.race([ + Context.current().cancelled, // Rejects with CancelledFailure + doExpensiveWork(), + ]); + } catch (err) { + if (err instanceof CancelledFailure) { + await cleanup(); + } + throw err; + } finally { + heartbeatEnabled = false; + } +} +``` + +```typescript +// GOOD - Use AbortSignal with libraries that support it +import fetch from 'node-fetch'; +import { cancellationSignal, heartbeat } from '@temporalio/activity'; +import type { AbortSignal as FetchAbortSignal } from 'node-fetch/externals'; + +export async function cancellableFetch(url: string): Promise { + const response = await fetch(url, { signal: cancellationSignal() as FetchAbortSignal }); + + const contentLength = parseInt(response.headers.get('Content-Length')!); + let bytesRead = 0; + const chunks: Buffer[] = []; + + for await (const chunk of response.body) { + if (!(chunk instanceof Buffer)) throw new TypeError('Expected Buffer'); + bytesRead += chunk.length; + chunks.push(chunk); + heartbeat(bytesRead / contentLength); // Heartbeat to keep cancellation delivery alive + } + return Buffer.concat(chunks); +} +``` + +**Note:** `Promise.race` doesn't stop the losing promise—it continues running. Use `cancellationSignal()` or explicitly abort sub-operations when cleanup requires stopping in-flight work. + +## Heartbeating + +### Forgetting to Heartbeat Long Activities + +```typescript +// BAD - No heartbeat, can't detect stuck activities +export async function processLargeFile(path: string): Promise { + for await (const chunk of readChunks(path)) { + await processChunk(chunk); // Takes hours, no heartbeat + } +} + +// GOOD - Regular heartbeats with progress +import { heartbeat } from '@temporalio/activity'; + +export async function processLargeFile(path: string): Promise { + let i = 0; + for await (const chunk of readChunks(path)) { + heartbeat(`Processing chunk ${i++}`); + await processChunk(chunk); + } +} +``` + +### Heartbeat Timeout Too Short + +```typescript +// BAD - Heartbeat timeout shorter than processing time +const { processChunk } = proxyActivities({ + startToCloseTimeout: '30 minutes', + heartbeatTimeout: '10 seconds', // Too short! +}); + +// GOOD - Heartbeat timeout allows for processing variance +const { processChunk } = proxyActivities({ + startToCloseTimeout: '30 minutes', + heartbeatTimeout: '2 minutes', +}); +``` + +Set heartbeat timeout as high as acceptable for your use case — each heartbeat counts as an action. + +## Testing + +### Not Testing Failures + +```typescript +import { TestWorkflowEnvironment } from '@temporalio/testing'; +import { Worker } from '@temporalio/worker'; + +test('handles activity failure', async () => { + const env = await TestWorkflowEnvironment.createTimeSkipping(); + + const worker = await Worker.create({ + connection: env.nativeConnection, + taskQueue: 'test', + workflowsPath: require.resolve('./workflows'), + activities: { + // Activity that always fails + riskyOperation: async () => { + throw ApplicationFailure.nonRetryable('Simulated failure'); + }, + }, + }); + + await worker.runUntil(async () => { + await expect( + env.client.workflow.execute(riskyWorkflow, { + workflowId: 'test-failure', + taskQueue: 'test', + }) + ).rejects.toThrow('Simulated failure'); + }); + + await env.teardown(); +}); +``` + +### Not Testing Replay + +```typescript +import { Worker } from '@temporalio/worker'; +import * as fs from 'fs'; + +test('replay compatibility', async () => { + const history = JSON.parse(await fs.promises.readFile('./fixtures/workflow_history.json', 'utf8')); + + // Fails if current code is incompatible with history + await Worker.runReplayHistory( + { + workflowsPath: require.resolve('./workflows'), + }, + history, + ); +}); +``` + +## Timers and Sleep + +`setTimeout` works in workflows (the SDK mocks it), but `sleep()` from `@temporalio/workflow` is preferred because its interaction with cancellation scopes is more intuitive. See Timers in `references/typescript/patterns.md`. diff --git a/references/typescript/observability.md b/references/typescript/observability.md new file mode 100644 index 0000000..10244d7 --- /dev/null +++ b/references/typescript/observability.md @@ -0,0 +1,109 @@ +# TypeScript SDK Observability + +## Overview + +The TypeScript SDK provides replay-aware logging, metrics, and integrations for production observability. + +## Replay-Aware Logging + +Temporal's logger automatically suppresses duplicate messages during replay, preventing log spam when workflows recover state. + +### Workflow Logging + +Workflows run in a sandboxed environment and cannot use regular Node.js loggers directly. Since SDK 1.8.0, the `@temporalio/workflow` package exports a `log` object that provides replay-aware logging. Internally, it uses Sinks to funnel messages to the Runtime's logger. + +```typescript +import { log } from '@temporalio/workflow'; + +export async function orderWorkflow(orderId: string): Promise { + log.info('Processing order', { orderId }); + + const result = await processPayment(orderId); + log.debug('Payment processed', { orderId, result }); + + return result; +} +``` + +**Log levels**: `log.debug()`, `log.info()`, `log.warn()`, `log.error()` + +The workflow logger automatically suppresses duplicate messages during replay and includes workflow context metadata (workflowId, runId, etc.) on every log entry. + +### Activity Logging + +```typescript +import { log } from '@temporalio/activity'; + +export async function processPayment(orderId: string): Promise { + log.info('Processing payment', { orderId }); + return 'payment-id-123'; +} +``` + +The activity logger adds contextual metadata (activity ID, type, namespace) and funnels messages to the runtime's logger for consistent collection. + +## Customizing the Logger + +### Basic Configuration + +```typescript +import { DefaultLogger, Runtime } from '@temporalio/worker'; + +const logger = new DefaultLogger('DEBUG', ({ level, message }) => { + console.log(`Custom logger: ${level} - ${message}`); +}); +Runtime.install({ logger }); +``` + +### Winston Integration + +```typescript +import winston from 'winston'; +import { DefaultLogger, Runtime } from '@temporalio/worker'; + +const winstonLogger = winston.createLogger({ + level: 'debug', + format: winston.format.json(), + transports: [ + new winston.transports.File({ filename: 'temporal.log' }) + ], +}); + +const logger = new DefaultLogger('DEBUG', (entry) => { + winstonLogger.log({ + label: entry.meta?.activityId ? 'activity' : entry.meta?.workflowId ? 'workflow' : 'worker', + level: entry.level.toLowerCase(), + message: entry.message, + timestamp: Number(entry.timestampNanos / 1_000_000n), + ...entry.meta, + }); +}); + +Runtime.install({ logger }); +``` + +## Metrics + +### Prometheus Metrics + +```typescript +import { Runtime } from '@temporalio/worker'; + +Runtime.install({ + telemetryOptions: { + metrics: { + prometheus: { + bindAddress: '127.0.0.1:9091', + }, + }, + }, +}); +``` + +## Best Practices + +1. Use `log` from `@temporalio/workflow` for production observability. For temporary print debugging, `console.log()` is fine—it's direct and immediate, whereas `log` goes through sinks which may lose messages on workflow errors +2. Include correlation IDs (orderId, customerId) in log messages +3. Configure Winston or similar for production log aggregation +4. Monitor Prometheus metrics for worker health +5. Use Event History for debugging workflow issues diff --git a/references/typescript/patterns.md b/references/typescript/patterns.md new file mode 100644 index 0000000..878f9f0 --- /dev/null +++ b/references/typescript/patterns.md @@ -0,0 +1,412 @@ +# TypeScript SDK Patterns + +## Signals + +```typescript +import { defineSignal, setHandler, condition } from '@temporalio/workflow'; + +const approveSignal = defineSignal<[boolean]>('approve'); +const addItemSignal = defineSignal<[string]>('addItem'); + +export async function orderWorkflow(): Promise { + let approved = false; + const items: string[] = []; + + setHandler(approveSignal, (value) => { + approved = value; + }); + + setHandler(addItemSignal, (item) => { + items.push(item); + }); + + await condition(() => approved); + return `Processed ${items.length} items`; +} +``` + +## Dynamic Signal Handlers + +For handling signals with names not known at compile time. Use cases for this pattern are rare — most workflows should use statically defined signal handlers. + +```typescript +import { setDefaultSignalHandler, condition } from '@temporalio/workflow'; + +export async function dynamicSignalWorkflow(): Promise> { + const signals: Record = {}; + + setDefaultSignalHandler((signalName: string, ...args: unknown[]) => { + if (!signals[signalName]) { + signals[signalName] = []; + } + signals[signalName].push(args); + }); + + await condition(() => signals['done'] !== undefined); + return signals; +} +``` + +## Queries + +**Important:** Queries must NOT modify workflow state or have side effects. + +```typescript +import { defineQuery, setHandler } from '@temporalio/workflow'; + +const statusQuery = defineQuery('status'); +const progressQuery = defineQuery('progress'); + +export async function progressWorkflow(): Promise { + let status = 'running'; + let progress = 0; + + setHandler(statusQuery, () => status); + setHandler(progressQuery, () => progress); + + for (let i = 0; i < 100; i++) { + progress = i; + await doWork(); + } + status = 'completed'; +} +``` + +## Dynamic Query Handlers + +For handling queries with names not known at compile time. Use cases for this pattern are rare — most workflows should use statically defined query handlers. + +```typescript +import { setDefaultQueryHandler } from '@temporalio/workflow'; + +export async function dynamicQueryWorkflow(): Promise { + const state: Record = { + status: 'running', + progress: 0, + }; + + setDefaultQueryHandler((queryName: string) => { + return state[queryName]; + }); + + // ... workflow logic +} +``` + +## Updates + +```typescript +import { defineUpdate, setHandler, condition } from '@temporalio/workflow'; + +// Define the update - specify return type and argument types +export const addItemUpdate = defineUpdate('addItem'); +export const addItemValidatedUpdate = defineUpdate('addItemValidated'); + +export async function orderWorkflow(): Promise { + const items: string[] = []; + let completed = false; + + // Simple update handler - returns new item count + setHandler(addItemUpdate, (item: string) => { + items.push(item); + return items.length; + }); + + // Update handler with validator - rejects invalid input before execution + setHandler( + addItemValidatedUpdate, + (item: string) => { + items.push(item); + return items.length; + }, + { + validator: (item: string) => { + if (!item) throw new Error('Item cannot be empty'); + if (items.length >= 100) throw new Error('Order is full'); + }, + } + ); + + await condition(() => completed); + return `Order with ${items.length} items completed`; +} +``` + +## Child Workflows + +```typescript +import { executeChild } from '@temporalio/workflow'; + +export async function parentWorkflow(orders: Order[]): Promise { + const results: string[] = []; + + for (const order of orders) { + const result = await executeChild(processOrderWorkflow, { + args: [order], + workflowId: `order-${order.id}`, + }); + results.push(result); + } + + return results; +} +``` + +### Child Workflow Options + +```typescript +import { executeChild, ParentClosePolicy, ChildWorkflowCancellationType } from '@temporalio/workflow'; + +const result = await executeChild(childWorkflow, { + args: [input], + workflowId: `child-${workflowInfo().workflowId}`, + + // ParentClosePolicy - what happens to child when parent closes + // TERMINATE (default), ABANDON, REQUEST_CANCEL + parentClosePolicy: ParentClosePolicy.TERMINATE, + + // ChildWorkflowCancellationType - how cancellation is handled + // WAIT_CANCELLATION_COMPLETED (default), WAIT_CANCELLATION_REQUESTED, TRY_CANCEL, ABANDON + cancellationType: ChildWorkflowCancellationType.WAIT_CANCELLATION_COMPLETED, +}); +``` + +## Handles to External Workflows + +```typescript +import { getExternalWorkflowHandle } from '@temporalio/workflow'; +import { mySignal } from './other-workflows'; + +export async function coordinatorWorkflow(targetWorkflowId: string): Promise { + const handle = getExternalWorkflowHandle(targetWorkflowId); + + // Signal the external workflow + await handle.signal(mySignal, { data: 'payload' }); + + // Or cancel it + await handle.cancel(); +} +``` + +## Parallel Execution + +```typescript +export async function parallelWorkflow(items: string[]): Promise { + return await Promise.all( + items.map((item) => processItem(item)) + ); +} +``` + +## Continue-as-New + +```typescript +import { continueAsNew, workflowInfo } from '@temporalio/workflow'; + +export async function longRunningWorkflow(state: State): Promise { + while (true) { + state = await processNextBatch(state); + + if (state.isComplete) { + return 'done'; + } + + const info = workflowInfo(); + if (info.continueAsNewSuggested || info.historyLength > 10000) { + await continueAsNew(state); + } + } +} +``` + +## Saga Pattern + +**Important:** Compensation activities should be idempotent. + +```typescript +import { log } from '@temporalio/workflow'; + +export async function sagaWorkflow(order: Order): Promise { + const compensations: Array<() => Promise> = []; + + try { + // IMPORTANT: Save compensation BEFORE calling the activity + // If activity fails after completing but before returning, + // compensation must still be registered + await reserveInventory(order); + compensations.push(() => releaseInventory(order)); + + await chargePayment(order); + compensations.push(() => refundPayment(order)); + + await shipOrder(order); + return 'Order completed'; + } catch (err) { + for (const compensate of compensations.reverse()) { + try { + await compensate(); + } catch (compErr) { + log.warn('Compensation failed', { error: compErr }); + } + } + throw err; + } +} +``` + +## Cancellation Scopes + +Cancellation scopes control how cancellation propagates to activities and child workflows. Use them for cleanup logic, timeouts, and manual cancellation. + +```typescript +import { CancellationScope, sleep } from '@temporalio/workflow'; + +export async function scopedWorkflow(): Promise { + // Non-cancellable scope - runs even if workflow cancelled + await CancellationScope.nonCancellable(async () => { + await cleanupActivity(); + }); + + // Timeout scope + await CancellationScope.withTimeout('5 minutes', async () => { + await longRunningActivity(); + }); + + // Manual cancellation + const scope = new CancellationScope(); + const promise = scope.run(() => someActivity()); + scope.cancel(); +} +``` + +## Triggers (Promise-like Signals) + +**WHY**: Triggers provide a one-shot promise that resolves when a signal is received. Cleaner than condition() for single-value signals. + +**WHEN to use**: +- Waiting for a single response (approval, completion notification) +- Converting signal-based events into awaitable promises + +```typescript +import { Trigger } from '@temporalio/workflow'; + +export async function triggerWorkflow(): Promise { + const approvalTrigger = new Trigger(); + + setHandler(approveSignal, (approved) => { + approvalTrigger.resolve(approved); + }); + + const approved = await approvalTrigger; + return approved ? 'Approved' : 'Rejected'; +} +``` + +## Wait Condition with Timeout + +```typescript +import { condition, CancelledFailure } from '@temporalio/workflow'; + +export async function approvalWorkflow(): Promise { + let approved = false; + + setHandler(approveSignal, () => { + approved = true; + }); + + // Wait for approval with 24-hour timeout + const gotApproval = await condition(() => approved, '24 hours'); + + if (gotApproval) { + return 'approved'; + } else { + return 'auto-rejected due to timeout'; + } +} +``` + +## Waiting for All Handlers to Finish + +Signal and update handlers should generally be non-async (avoid running activities from them). Otherwise, the workflow may complete before handlers finish their execution. However, making handlers non-async sometimes requires workarounds that add complexity. + +When async handlers are necessary, use `condition(allHandlersFinished)` at the end of your workflow (or before continue-as-new) to prevent completion until all pending handlers complete. + +```typescript +import { condition, allHandlersFinished } from '@temporalio/workflow'; + +export async function handlerAwareWorkflow(): Promise { + // ... main workflow logic ... + + // Before exiting, wait for all handlers to finish + await condition(allHandlersFinished); + return 'done'; +} +``` + +## Activity Heartbeat Details + +### WHY: +- **Support activity cancellation** - Cancellations are delivered via heartbeat; activities that don't heartbeat won't know they've been cancelled +- **Resume progress after worker failure** - Heartbeat details persist across retries + +### WHEN: +- **Cancellable activities** - Any activity that should respond to cancellation +- **Long-running activities** - Track progress for resumability +- **Checkpointing** - Save progress periodically + +```typescript +import { heartbeat, activityInfo, CancelledFailure } from '@temporalio/activity'; + +export async function processLargeFile(filePath: string): Promise { + const info = activityInfo(); + // Get heartbeat details from previous attempt (if any) + const startLine: number = info.heartbeatDetails ?? 0; + + const lines = await readFileLines(filePath); + + try { + for (let i = startLine; i < lines.length; i++) { + await processLine(lines[i]); + // Heartbeat with progress + // If activity is cancelled, heartbeat() throws CancelledFailure + heartbeat(i + 1); + } + return 'completed'; + } catch (e) { + if (e instanceof CancelledFailure) { + // Perform cleanup on cancellation + await cleanup(); + } + throw e; + } +} +``` + +## Timers + +```typescript +import { sleep } from '@temporalio/workflow'; + +export async function timerWorkflow(): Promise { + await sleep('1 hour'); + return 'Timer fired'; +} +``` + +## Local Activities + +**Purpose**: Reduce latency for short, lightweight operations by skipping the task queue. ONLY use these when necessary for performance. Do NOT use these by default, as they are not durable and distributed. + +```typescript +import { proxyLocalActivities } from '@temporalio/workflow'; +import type * as activities from './activities'; + +const { quickLookup } = proxyLocalActivities({ + startToCloseTimeout: '5 seconds', +}); + +export async function localActivityWorkflow(): Promise { + const result = await quickLookup('key'); + return result; +} +``` diff --git a/references/typescript/testing.md b/references/typescript/testing.md new file mode 100644 index 0000000..e945ed8 --- /dev/null +++ b/references/typescript/testing.md @@ -0,0 +1,222 @@ +# TypeScript SDK Testing + +## Overview + +The TypeScript SDK provides `TestWorkflowEnvironment` for testing workflows with time-skipping and activity mocking support. Use `createTimeSkipping()` for automatic time advancement when testing workflows with timers, or `createLocal()` for a full local server without time-skipping. + +**Note:** Prefer to use `createLocal()` for full-featured support. Only use `createTimeSkipping()` if you genuinely need time skipping for testing your workflow. + +## Test Environment Setup + +```typescript +import { TestWorkflowEnvironment } from '@temporalio/testing'; +import { Worker } from '@temporalio/worker'; + +describe('Workflow', () => { + let testEnv: TestWorkflowEnvironment; + + before(async () => { + testEnv = await TestWorkflowEnvironment.createLocal(); + }); + + after(async () => { + await testEnv?.teardown(); + }); + + it('runs workflow', async () => { + const { client, nativeConnection } = testEnv; + + const worker = await Worker.create({ + connection: nativeConnection, + taskQueue: 'test', + workflowsPath: require.resolve('./workflows'), + activities: require('./activities'), + }); + + await worker.runUntil(async () => { + const result = await client.workflow.execute(greetingWorkflow, { + taskQueue: 'test', + workflowId: 'test-workflow', + args: ['World'], + }); + expect(result).toEqual('Hello, World!'); + }); + }); +}); +``` + +## Activity Mocking + +```typescript +const worker = await Worker.create({ + connection: nativeConnection, + taskQueue: 'test', + workflowsPath: require.resolve('./workflows'), + activities: { + // Mock activity implementation + greet: async (name: string) => `Mocked: ${name}`, + }, +}); +``` + +## Testing Signals and Queries + +```typescript +import { defineQuery, defineSignal } from '@temporalio/workflow'; + +// Define query and signal (typically in a shared file) +const getStatusQuery = defineQuery('getStatus'); +const approveSignal = defineSignal('approve'); + +it('handles signals and queries', async () => { + await worker.runUntil(async () => { + const handle = await client.workflow.start(approvalWorkflow, { + taskQueue: 'test', + workflowId: 'approval-test', + }); + + // Query current state + const status = await handle.query(getStatusQuery); + expect(status).toEqual('pending'); + + // Send signal + await handle.signal(approveSignal); + + // Wait for completion + const result = await handle.result(); + expect(result).toEqual('Approved!'); + }); +}); +``` + +## Testing Failure Cases + +Test that workflows handle errors correctly: + +```typescript +import { TestWorkflowEnvironment } from '@temporalio/testing'; +import { Worker } from '@temporalio/worker'; +import { WorkflowFailedError } from '@temporalio/client'; +import assert from 'assert'; + +describe('Failure handling', () => { + let testEnv: TestWorkflowEnvironment; + + before(async () => { + testEnv = await TestWorkflowEnvironment.createLocal(); + }); + + after(async () => { + await testEnv?.teardown(); + }); + + it('handles activity failure', async () => { + const { client, nativeConnection } = testEnv; + + const worker = await Worker.create({ + connection: nativeConnection, + taskQueue: 'test', + workflowsPath: require.resolve('./workflows'), + activities: { + // Mock activity that always fails + myActivity: async () => { + throw new Error('Activity failed'); + }, + }, + }); + + await worker.runUntil(async () => { + try { + await client.workflow.execute(myWorkflow, { + workflowId: 'test-failure', + taskQueue: 'test', + }); + assert.fail('Expected workflow to fail'); + } catch (err) { + assert(err instanceof WorkflowFailedError); + } + }); + }); +}); +``` + +## Replay Testing + +```typescript +import { Worker } from '@temporalio/worker'; +import { Client, Connection } from '@temporalio/client'; +import fs from 'fs'; + +describe('Replay', () => { + it('replays workflow history from JSON file', async () => { + // Load history from a JSON file (exported from Web UI or Temporal CLI) + const filePath = './history_file.json'; + const history = JSON.parse(await fs.promises.readFile(filePath, 'utf8')); + + await Worker.runReplayHistory( + { + workflowsPath: require.resolve('./workflows'), + }, + history, + 'my-workflow-id' // Optional: provide workflowId if your workflow depends on it + ); + }); + + it('replays workflow history from server', async () => { + // Fetch history programmatically using the client + const connection = await Connection.connect({ address: 'localhost:7233' }); + const client = new Client({ connection, namespace: 'default' }); + const handle = client.workflow.getHandle('my-workflow-id'); + const history = await handle.fetchHistory(); + + await Worker.runReplayHistory( + { + workflowsPath: require.resolve('./workflows'), + }, + history, + 'my-workflow-id' + ); + }); +}); +``` + +## Activity Testing + +Test activities in isolation without running a workflow: + +```typescript +import { MockActivityEnvironment } from '@temporalio/testing'; +import { CancelledFailure } from '@temporalio/activity'; +import { myActivity } from './activities'; +import assert from 'assert'; + +describe('Activity tests', () => { + it('completes successfully', async () => { + const env = new MockActivityEnvironment(); + const result = await env.run(myActivity, 'input'); + assert.equal(result, 'expected output'); + }); + + it('handles cancellation', async () => { + const env = new MockActivityEnvironment(); + // Cancel the activity after a short delay + setTimeout(() => env.cancel(), 100); + try { + await env.run(longRunningActivity, 'input'); + assert.fail('Expected cancellation'); + } catch (err) { + assert(err instanceof CancelledFailure); + } + }); +}); +``` + +**Note:** `MockActivityEnvironment` provides `heartbeat()` and cancellation support for testing activity behavior. + +## Best Practices + +1. Use time-skipping for workflows with timers +2. Mock external dependencies in activities +3. Test replay compatibility when changing workflow code +4. Use unique workflow IDs per test +5. Clean up test environment after tests diff --git a/references/typescript/typescript.md b/references/typescript/typescript.md new file mode 100644 index 0000000..9918ee7 --- /dev/null +++ b/references/typescript/typescript.md @@ -0,0 +1,172 @@ +# Temporal TypeScript SDK Reference + +## Overview + +The Temporal TypeScript SDK provides a modern Promise based approach to building durable workflows. Workflows are bundled and run in an isolated runtime with automatic replacements for determinism protection. + +**CRITICAL**: All `@temporalio/*` packages must have the same version number. + +## Understanding Replay + +Temporal workflows are durable through history replay. For details on how this works, see `references/core/determinism.md`. + +## Quick Start + +**Add Dependencies:** Install the Temporal SDK packages (use the package manager appropriate for your project): +```bash +npm install @temporalio/client @temporalio/worker @temporalio/workflow @temporalio/activity +``` + +Note: if you are working in production, it is strongly advised to use ~ version constraints, i.e. `npm install ... --save-prefix='~'` if using NPM. + +**activities.ts** - Activity definitions (separate file to distinguish workflow vs activity code): +```typescript +export async function greet(name: string): Promise { + return `Hello, ${name}!`; +} +``` + +**workflows.ts** - Workflow definition (use type-only imports for activities): +```typescript +import { proxyActivities } from '@temporalio/workflow'; +import type * as activities from './activities'; + +const { greet } = proxyActivities({ + startToCloseTimeout: '1 minute', +}); + +export async function greetingWorkflow(name: string): Promise { + return await greet(name); +} +``` + +**worker.ts** - Worker setup (imports activities and workflows, runs indefinitely): +```typescript +import { Worker } from '@temporalio/worker'; +import * as activities from './activities'; + +async function run() { + const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), // For production, use workflowBundle instead + activities, + taskQueue: 'greeting-queue', + }); + await worker.run(); +} + +run().catch(console.error); +``` + +**Start the dev server:** Start `temporal server start-dev` in the background. + +**Start the worker:** Run `npx ts-node worker.ts` in the background. + +**client.ts** - Start a workflow execution: +```typescript +import { Client } from '@temporalio/client'; +import { greetingWorkflow } from './workflows'; +import { v4 as uuid } from 'uuid'; + +async function run() { + const client = new Client(); + + const result = await client.workflow.execute(greetingWorkflow, { + workflowId: uuid(), + taskQueue: 'greeting-queue', + args: ['my name'], + }); + + console.log(`Result: ${result}`); +} + +run().catch(console.error); +``` + +**Run the workflow:** Run `npx ts-node client.ts`. Should output: `Result: Hello, my name!`. + +## Key Concepts + +### Workflow Definition +- Async functions exported from workflow file +- Use `proxyActivities()` with type-only imports +- Use `defineSignal()`, `defineQuery()`, `defineUpdate()`, `setHandler()` for handlers + +### Activity Definition +- Regular async functions +- Can perform I/O, network calls, etc. +- Use `heartbeat()` for long operations + +### Worker Setup +- Use `Worker.create()` with `workflowsPath` (dev) or `workflowBundle` (production) - see `references/typescript/gotchas.md` +- Import activities directly (not via proxy) + +## File Organization Best Practice + +**Keep Workflow definitions in separate files from Activity definitions.** The TypeScript SDK bundles workflow files separately. Minimizing workflow file contents improves Worker startup time. + +``` +my_temporal_app/ +├── workflows/ +│ └── greeting.ts # Only Workflow functions +├── activities/ +│ └── translate.ts # Only Activity functions +├── worker.ts # Worker setup, imports both +└── client.ts # Client code to start workflows +``` + +**In the Workflow file, use type-only imports for activities:** +```typescript +// workflows/greeting.ts +import { proxyActivities } from '@temporalio/workflow'; +import type * as activities from '../activities/translate'; + +const { translate } = proxyActivities({ + startToCloseTimeout: '1 minute', +}); +``` + +## Determinism Rules + +The TypeScript SDK runs workflows in an isolated V8 sandbox. + +**Automatic replacements:** +- `Math.random()` → deterministic seeded PRNG +- `Date.now()` → workflow start time +- `setTimeout` → deterministic timer + +**Safe to use:** +- `sleep()` from `@temporalio/workflow` +- `condition()` for waiting +- Standard JavaScript operations + +See `references/typescript/determinism.md` for detailed rules. + +## Common Pitfalls + +1. **Importing activities without `type`** - Use `import type * as activities` +2. **Version mismatch** - All @temporalio packages must match +3. **Direct I/O in workflows** - Use activities for external calls +4. **Missing `proxyActivities`** - Required to call activities from workflows +5. **Forgetting to bundle workflows** - Worker needs `workflowsPath` or `workflowBundle` +6. **Using workflowsPath in production** - Use `workflowBundle` for production (see `references/typescript/gotchas.md`) +7. **Forgetting to heartbeat** - Long-running activities need `heartbeat()` calls +8. **Logging in workflows** - For observability, use `import { log } from '@temporalio/workflow'` (routes through sinks). For temporary print debugging, `console.log()` is fine—it's direct and immediate, whereas `log` may lose messages on workflow errors. +9. **Forgetting to wait on activity calls** - Activity calls return Promises; you must eventually await them (directly or via `Promise.all()` for parallel execution) + +## Writing Tests + +See `references/typescript/testing.md` for info on writing tests. + +## Additional Resources + +### Reference Files +- **`references/typescript/patterns.md`** - Signals, queries, child workflows, saga pattern, etc. +- **`references/typescript/determinism.md`** - Essentials of determinism in TypeScript +- **`references/typescript/gotchas.md`** - TypeScript-specific mistakes and anti-patterns +- **`references/typescript/error-handling.md`** - ApplicationFailure, retry policies, non-retryable errors +- **`references/typescript/observability.md`** - Logging, metrics, tracing +- **`references/typescript/testing.md`** - TestWorkflowEnvironment, time-skipping, activity mocking +- **`references/typescript/advanced-features.md`** - Schedules, worker tuning, and more +- **`references/typescript/data-handling.md`** - Data converters, payload encryption, etc. +- **`references/typescript/versioning.md`** - Patching API, workflow type versioning, Worker Versioning +- **`references/typescript/determinism-protection.md`** - V8 sandbox and bundling diff --git a/references/typescript/versioning.md b/references/typescript/versioning.md new file mode 100644 index 0000000..a9f57a2 --- /dev/null +++ b/references/typescript/versioning.md @@ -0,0 +1,211 @@ +# TypeScript SDK Versioning + +For conceptual overview and guidance on choosing an approach, see `references/core/versioning.md`. + +## Patching API + +The Patching API lets you change Workflow Definitions without causing non-deterministic behavior in running Workflows. + +### The patched() Function + +The `patched()` function takes a `patchId` string and returns a boolean: + +```typescript +import { patched } from '@temporalio/workflow'; + +export async function myWorkflow(): Promise { + if (patched('my-change-id')) { + // New code path + await newImplementation(); + } else { + // Old code path (for replay of existing executions) + await oldImplementation(); + } +} +``` + +**How it works:** +- If the Workflow is running for the first time, `patched()` returns `true` and inserts a marker into the Event History +- During replay, if the history contains a marker with the same `patchId`, `patched()` returns `true` +- During replay, if no matching marker exists, `patched()` returns `false` + +**TypeScript-specific behavior:** Unlike Python/.NET/Ruby, `patched()` is not memoized when it returns `false`. This means you can use `patched()` in loops. However, if a single patch requires coordinated behavioral changes at different points in your workflow, you may need to manually memoize the result: + +```typescript +const useNewBehavior = patched('my-change'); +// Use useNewBehavior at multiple points in workflow +``` + +### Three-Step Patching Process + +Patching is a three-step process for safely deploying changes. + +**Warning:** Failing to follow this process correctly will result in non-determinism errors for in-flight workflows. + +#### Step 1: Patch in New Code + +Add the patch alongside the old code: + +```typescript +import { patched } from '@temporalio/workflow'; + +// Original code sent fax notifications +export async function shippingConfirmation(): Promise { + if (patched('changedNotificationType')) { + await sendEmail(); // New code + } else { + await sendFax(); // Old code for replay + } + await sleep('1 day'); +} +``` + +#### Step 2: Deprecate the Patch + +Once all Workflows using the old code have completed, deprecate the patch: + +```typescript +import { deprecatePatch } from '@temporalio/workflow'; + +export async function shippingConfirmation(): Promise { + deprecatePatch('changedNotificationType'); + await sendEmail(); + await sleep('1 day'); +} +``` + +The `deprecatePatch()` function records a marker that does not fail replay when Workflow code does not emit it, allowing a transition period. + +#### Step 3: Remove the Patch + +After all Workflows using `deprecatePatch` have completed, remove it entirely: + +```typescript +export async function shippingConfirmation(): Promise { + await sendEmail(); + await sleep('1 day'); +} +``` + +### Query Filters for Versioned Workflows + +Use List Filters to find Workflows by version: + +``` +# Find running Workflows with a specific patch +WorkflowType = "shippingConfirmation" AND ExecutionStatus = "Running" AND TemporalChangeVersion = "changedNotificationType" + +# Find running Workflows without the patch (started before patching) +WorkflowType = "shippingConfirmation" AND ExecutionStatus = "Running" AND TemporalChangeVersion IS NULL +``` + +## Workflow Type Versioning + +An alternative to patching is creating new Workflow functions for incompatible changes: + +```typescript +// Original Workflow +export async function pizzaWorkflow(order: PizzaOrder): Promise { + // Original implementation +} + +// New version with incompatible changes +export async function pizzaWorkflowV2(order: PizzaOrder): Promise { + // Updated implementation +} +``` + +Register both Workflows with the Worker: + +```typescript +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), // Use workflowBundle for production + taskQueue: 'pizza-queue', +}); +``` + +Update client code to start new Workflows with the new type: + +```typescript +// Start new executions with V2 +await client.workflow.start(pizzaWorkflowV2, { + workflowId: 'order-123', + taskQueue: 'pizza-queue', + args: [order], +}); +``` + +Use List Filters to check for remaining V1 executions: + +``` +WorkflowType = "pizzaWorkflow" AND ExecutionStatus = "Running" +``` + +After all V1 executions complete, remove the old Workflow function. + +## Worker Versioning + +Worker Versioning allows multiple Worker versions to run simultaneously, routing Workflows to specific versions without code-level patching. Workflows are pinned to the Worker Deployment Version they started on. + +> **Note:** Worker Versioning is currently in Public Preview. The legacy Worker Versioning API (before 2025) will be removed from Temporal Server in March 2026. + +### Key Concepts + +- **Worker Deployment**: A logical name for your application (e.g., "order-service") +- **Worker Deployment Version**: A specific build identified by deployment name + Build ID +- **Workflow Pinning**: Workflows complete on the Worker Deployment Version they started on + +### Configuring Workers for Versioning + +```typescript +import { Worker, NativeConnection } from '@temporalio/worker'; + +const worker = await Worker.create({ + workflowsPath: require.resolve('./workflows'), // Use workflowBundle for production + taskQueue: 'my-queue', + connection: await NativeConnection.connect({ address: 'temporal:7233' }), + workerDeploymentOptions: { + useWorkerVersioning: true, + version: { + deploymentName: 'order-service', + buildId: '1.0.0', // Git hash, semver, build number, etc. + }, + }, +}); +``` + +**Configuration options:** +- `useWorkerVersioning`: Enables Worker Versioning +- `version.deploymentName`: Logical name for your service (consistent across versions) +- `version.buildId`: Unique identifier for this build + +### Deployment Workflow + +1. Deploy new Worker version with a new `buildId` +2. Use the Temporal CLI to set the new version as current: + ```bash + temporal worker deployment set-current-version \ + --deployment-name order-service \ + --build-id 2.0.0 + ``` +3. New Workflows start on the new version +4. Existing Workflows continue on their original version until completion +5. Decommission old Workers once all their Workflows complete + +### When to Use Worker Versioning + +Worker Versioning is best suited for: +- **Short-running Workflows**: Old Workers only need to run briefly during deployment transitions +- **Frequent deployments**: Eliminates the need for code-level patching on every change +- **Blue-green deployments**: Run old and new versions simultaneously with traffic control + +For long-running Workflows, consider combining Worker Versioning with the Patching API, or use Continue-as-New to move Workflows to newer versions. + +## Best Practices + +1. Use descriptive `patchId` names that explain the change +2. Follow the three-step patching process completely before removing patches +3. Use List Filters to verify no running Workflows before removing version support +4. Keep Worker Deployment names consistent across all versions +5. Use unique, traceable Build IDs (git hashes, semver, timestamps) +6. Test version transitions with replay tests before deploying