speechmatics · giorgosHadji · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026
diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -225,6 +225,301 @@ The following example shows how to use `--all-formats` parameter. In this scenar
 </CodeBlock>
 
 
+## Batch persisted worker transcription
+
+**This feature is available for onPrem containers only.**
+
+**Shall we mention the version which this is available too?????**
+
+Batch persisted workers (known as http batch workers), are batch multi session capable persisted workers. They work utilizing an http server, which is able to 
+accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
+of posting a job, checking the status of the jobs and retrieving for the transcript.
+
+The main benefit of this worker vs normal batch is that you don't incur the cost of spinning up the worker for each you want to transcribe.
+This has the benefit of reduding the turnaround time, especially for smaller files. The memory utilization is reduced as now you can have
+multiple jobs running in parallel in the same container sharing the memory, and remove the need to spin up mulitple container incuring the same memory cost as many times.
+Better utilizing the gpu as now we don't have initial setup times for the worker, and we are able to use the gpu uninterrupted.
+
+### How to run the worker and submit jobs to it
+You can run the persisted worker with:
+
+<CodeBlock language="bash">
+  {`docker run -it \\
+    -e LICENSE_TOKEN=$TOKEN_VALUE \\
+    -p PORT:18000 \\
+    batch-asr-transcriber-en:${smVariables.latestContainerVersion} \\
+    --run-mode http \\
+    --parallel=4 \\
+    --all-formats /output_dir_name
+`}
+</CodeBlock>
+
+The parameters are:
+- `parallel` - The number of parallel sessions you want this container to have (Each session corresponds to one gpu connection). The more sessions the higher
+  throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK).
+- `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats).
+  If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`.
+- `PORT` The port of your local environment you will forward to docker container's port.
+
+**Do we need to say that they can set up the internal port via an env.variable as well? 
+`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to**
+
+To submit a job you can either use curl directly or use the python sdk.
+With curl:
+```
+    curl -X POST address.of.container:PORT/v2/jobs \
+    -H 'X-SM-Processing-Data: {"parallel_engines":2, "user_id":"MY_USER_ID"}' \
+    -F 'config={
+            "type":"transcription",
+            "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"}
+            }' \
+    -F 'data_file=@~/audio_file.mp3'
+```
+
+Returns:
+```
+on success: json string containing job id: `{"job_id": "abcdefgh01"}` and HTTP status code 201
+on failure: returns an HTTP status code != 200:
+  HTTP status code 503 for server busy
+  HTTP status code 400 for invalid request
+```
+
+with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription):
+```
+import asyncio
+import os
+from dotenv import load_dotenv
+from speechmatics.batch import AsyncClient
+
+load_dotenv()
+
+async def main():
+    client = AsyncClient(api_key=os.getenv("SPEECHMATICS_API_KEY"), url="address.of.container:PORT/v2")
+    result = await client.transcribe("audio.wav",parallel_engines=2, user_id="MY_USER_ID")
+    print(result.transcript_text)
+    await client.close()
+
+asyncio.run(main())
+```
+
+With the persisted batch worker you have the capability to submit multiple jobs on the same worker given it has enough free capacity to process them.
+You can figure the free capacity left by querying the `/ready` endpoint outlined below. The result of this endpoint will include (`engines_used`) the total number of engines being 
+used by the running jobs now. To calculate the number of free engines you subtract the initial number of parallel engines you spinned 
+up the worker (set using `--parallel=NUM`) minus the engines you currently use.
+
+If as part of a job you request more engines that those free, the job won't be accepted and will return a 503 with:
+
+`HTTP 503: Service Unavailable - {"detail":"Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"}`
+
+
+By requesting more engines in parallel for a job, you are able to improve the turnaround time for the job.
+
+To request multiple engines in parallel for a job you need to add a header in the POST request called `X-SM-Processing-Data`, which receives as input a json dictionary.
+The specify the number of parallel engines you want you need to add to this header a dict with key `parallel_engines` and as value the number of engines you want.
+
+For example with curl:
+```
+    curl -X POST address.of.container:PORT/v2/jobs \
+    -H 'X-SM-Processing-Data: {"parallel_engines":2}' \
+    -F 'config={
+            "type":"transcription",
+            "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"}
+            }' \
+    -F 'data_file=@~/audio_file.mp3'
+```
+
+To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature using the same header as above `X-SM-Processing-Data`
+insert as a key `user_id`, and value the id of the user/customer.
+```
+    curl -X POST address.of.container:PORT/v2/jobs \
+    -H 'X-SM-Processing-Data: {"user_id":"MY_USER_ID"}' \
+    -F 'config={
+            "type":"transcription",
+            "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"}
+            }' \
+    -F 'data_file=@~/audio_file.mp3'
+```
+
+### Job API endpoints
+
+`/v2/jobs`
+
+args: created_before: string in ISO 8601 format, only returns jobs created before this time
+limit: maximum number of jobs to return, can be between 1 and 100
+
+returns: list of jobs
+```json
+{
+  "jobs": [
+    {
+      "id": "191f47e4a4204fa4ac2b",
+      "created_at": "2026-03-18T19:27:42.436Z",
+      "data_name": "5_min",
+      "text_name": null,
+      "duration": 300,
+      "status": "RUNNING",
+      "config": {
+        "type": "transcription",
+        "transcription_config": {
+          "language": "en",
+          "diarization": "speaker",
+          "operating_point": "enhanced"
+        }
+      }
+    },
+    {
+      "id": "6dcb02e0dc5943e2b643",
+      "created_at": "2026-03-18T19:27:47.550Z",
+      "data_name": "5_min",
+      "text_name": null,
+      "duration": 300,
+      "status": "RUNNING",
+      "config": {
+        "type": "transcription",
+        "transcription_config": {
+          "language": "en",
+          "diarization": "speaker",
+          "operating_point": "enhanced"
+        }
+      }
+    }
+  ]
+}
+```
+
+
+`/v2/jobs/{job_id}/transcript`
+
+args: job_id and format of the transcript. Options for the format transcript currently are : "json", "txt", "srt".
+
+Returns the transcript for a specific job if it has finished, the format is a valid choice, and the job_id exists.
+
+if the job_id doesn’t exist returns an HTTPException with 404.
+
+if the job hasn’t finished, returns a 404, and includes the status and request_id.
+
+if the format is not in our included list we return a 404 with error = unsupported format.
+
+
+`/v2/jobs/{job_id}`
+
+returns job status, including job_id and request_id
+
+```json
+{
+  "job": {
+    "id": "191f47e4a4204fa4ac2b",
+    "created_at": "2026-03-18T19:27:42.436Z",
+    "data_name": "5_min",
+    "duration": 300,
+    "status": "DONE",
+    "config": {
+      "type": "transcription",
+      "transcription_config": {
+        "language": "en",
+        "diarization": "speaker",
+        "operating_point": "enhanced"
+      }
+    },
+    "request_id": "191f47e4a4204fa4ac2b"
+  }
+}
+```
+
+`/v2/jobs/{job_id}/log`
+
+returns the logs for the specific job
+
+
+### Health service
+
+The container is exposes an http Health Service, which offers liveness, readiness, and session listing probes. This is accessible from the same port
+as job posting, and has 3 endpoints, `live`, `ready` and `sessions`. This may be especially helpful if you are deploying the container into a Kubernetes
+cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around
+[liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).
+
+#### Endpoints
+
+The Health Service offers three endpoints:
+
+#### `/sessions`
+
+This endpoint provides a list of the currently running jobs. It can be queried using an HTTP GET request.
+Returns a list of the currently running jobs, which has a comma separate string of request_id and parallel_engines used for this job pair.
+
+Example:
+
+```bash-and-response
+$ curl -i address.of.container:PORT/sessions
+HTTP/1.1 200 OK
+Server: BaseHTTP/0.6 Python/3.8.5
+Date: Mon, 08 Feb 2021 12:46:21 GMT
+Content-Type: application/json
+{
+  "request_ids": [
+    "978174b1564e40ccacba,2",
+    "52d532a2efcb4b78962b,2"
+  ]
+}
+```
+
+#### `/live`
+
+This endpoint provides a liveness probe. It can be queried using an HTTP GET request.
+
+This probe indicates whether all services in the Container are active. 
+
+Possible responses:
+
+- `200` if all of the services in the Container have successfully started, and have recently sent an update to the Health Service.
+
+A JSON object is also returned in the body of the response, indicating the status.
+
+Example:
+
+```bash-and-response
+$ curl -i address.of.container:PORT/live
+HTTP/1.1 200 OK
+Server: BaseHTTP/0.6 Python/3.8.5
+Date: Mon, 08 Feb 2021 12:46:45 GMT
+Content-Type: application/json
+{
+    "live": true
+}
+```
+
+#### `/ready`
+
+This endpoint provides a readiness probe. It can be queried using an HTTP GET request.
+
+The container has been designed to process multiple jobs cuncurrently. This probe indicates whether the container has one slot (one engine) free for connections, and can be used as a scaling mechanism.
+
+return `{"ready": True, "engines_used": self.engines_used}`
+Possible responses:
+
+- `200` if the container has a free connection slot.
+- `503` otherwise.
+
+In the body of the response there is also a JSON object with the current status, and the total number of engines being used.
+
+Example:
+
+```bash-and-response
+$ curl -i address.of.container:PORT/ready
+HTTP/1.1 200 OK
+Server: BaseHTTP/0.6 Python/3.8.5
+Date: Mon, 08 Feb 2021 12:47:05 GMT
+Content-Type: application/json
+{
+    "ready": true,
+    "engines_used": 2
+}
+```
+
+Environment variables:
+
+`SM_BATCH_WORKER_MAX_JOB_HISTORY` : This is the maximum number of job records to keep in memory
+
 ## Realtime transcription
 
 The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file.