From d1363cbd7777a39338df564313a196f9ed821362 Mon Sep 17 00:00:00 2001
From: Georgios Hadjiharalambous <georgiosh@speechmatics.com>
Date: Wed, 18 Mar 2026 17:14:56 +0000
Subject: [PATCH 1/7] first commit putting most of the information  I want out
 there

---
 .../container/cpu-speech-to-text.mdx          | 183 ++++++++++++++++++
 1 file changed, 183 insertions(+)
diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
index 60b63656..53aa8531 100644
--- a/docs/deployments/container/cpu-speech-to-text.mdx
+++ b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -225,6 +225,189 @@ The following example shows how to use `--all-formats` parameter. In this scenar
 </CodeBlock>
 
 
+## Batch persisted worker transcription
+
+Batch persisted workers (knows as http batch workers), are multi session capable persisted workers. They work utilizing an http server, which is able to 
+accept jobs through POST and by using the [V2 Batch REST API] (https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
+of posting a job, to checking the status of the jobs and asking for the transcript.
+
+
+You can run the persisted worker with:
+
+<CodeBlock language="bash">
+  {`docker run -it \\
+    -e LICENSE_TOKEN=$TOKEN_VALUE \\
+    -p PORT:18000 \\
+    batch-asr-transcriber-en:${smVariables.latestContainerVersion} \\
+    --run-mode http \\
+    --parallel=4 \\
+    --all-formats /output_dir_name
+`}
+</CodeBlock>
+
+The parameters are:
+- `parallel` - The number of parallel sessions you want this container to have (Each session corresponds to one gpu connection). The more sessions the higher
+  throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK).
+- `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats).
+  If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`
+
+To submit a job you can either use curl directly or using the python sdk.
+With curl:
+```
+    curl -X POST address.of.container:PORT/v2/jobs \
+    -H 'X-SM-Processing-Data: {"parallel_engines":2, "user_id":"MY_USER_ID"}' \
+    -F 'config={
+            "type":"transcription",
+            "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"}
+            }' \
+    -F 'data_file=@~/audio_file.mp3'
+```
+Returns:
+                on success:json string containing job id: {"job_id": "abcdefgh01"} and HTTP status code 201
+                on failure: technically it raises but the exception is translated to HTTP status code != 200:
+                HTTP status code 503 for server busy
+                HTTP status code 400 for invalid request
+
+with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription):
+```
+import asyncio
+import os
+from dotenv import load_dotenv
+from speechmatics.batch import AsyncClient
+
+load_dotenv()
+
+async def main():
+    client = AsyncClient(api_key=os.getenv("SPEECHMATICS_API_KEY"), url="address.of.container:PORT/v2")
+    result = await client.transcribe("audio.wav",parallel_engines=2, user_id="MY_USER_ID")
+    print(result.transcript_text)
+    await client.close()
+
+asyncio.run(main())
+```
+
+## Job specific endpoints
+
+/v2/jobs
+
+args: created_before: string in ISO 8601 format, only returns jobs created before this time
+limit: maximum number of jobs to return, can be between 1 and 100
+
+returns: list of jobs
+
+/v2/jobs/{job_id}/transcript
+
+args: job_id and format of the transcript. Options for the transcript currently are : "json", "txt", "srt" (we might need to add an ALL option). Maybe we can return all due to the nature of the http requests, but all formats are probably saved already locally?(todo find out)
+
+Returns the transcript for a specific job if it has finished, the format is a valid choice, and the job_id exists.
+
+if the job_id doesn’t exist returns an HTTPException with 404.
+
+if the job hasn’t finished, returns a 404, and includes the status and request_id.
+
+if the format is not in our included list we return a 404 with error = unsupported format
+
+
+/v2/jobs/{job_id}
+
+returns job status, including job_id and request_id
+
+
+/v2/jobs/{job_id}/log
+
+returns the logs for the specific job
+
+
+## Health service
+
+The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready` and `session_status`. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around
+[liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).
+
+The Health Service is enabled by default and runs as a subprocess of the main entrypoint to the container.
+
+### Endpoints
+
+The Health Service offers four endpoints:
+
+#### `/sessions`
+
+
+                    f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding
+
+Possible responses:
+
+- `200` if all of the services in the container have successfully started.
+
+A JSON object is also returned in the body of the response, indicating the status.
+
+Example:
+
+```bash-and-response
+$ curl -i address.of.container:PORT/sessions
+HTTP/1.0 200 OK
+Server: BaseHTTP/0.6 Python/3.8.5
+Date: Mon, 08 Feb 2021 12:46:21 GMT
+Content-Type: application/json
+{
+    "started": true
+}
+```
+
+#### `/live`
+
+This endpoint provides a liveness probe. It can be queried using an HTTP GET request. You must include the relevant port, 8001, in the request.
+
+This probe indicates whether all services in the Container are active. The services in the Container send regular updates to the Health Service, if they don't send an update for more than 10 seconds then they will be marked as 'dead' and this endpoint will return an unsuccessful response code. For example, if the WebSocket server in the Container were to crash, this endpoint should indicate that.
+
+Possible responses:
+
+- `200` if all of the services in the Container have successfully started, and have recently sent an update to the Health Service.
+- `503` otherwise.
+
+A JSON object is also returned in the body of the response, indicating the status.
+
+Example:
+
+```bash-and-response
+$ curl -i address.of.container:PORT/live
+HTTP/1.0 200 OK
+Server: BaseHTTP/0.6 Python/3.8.5
+Date: Mon, 08 Feb 2021 12:46:45 GMT
+Content-Type: application/json
+{
+    "live": true
+}
+```
+
+#### `/ready`
+
+This endpoint provides a readiness probe. It can be queried using an HTTP GET request.
+
+The container has been designed to process multiple audio streams at a time. This probe indicates whether the container has a slot free for connections, and can be used as a scaling mechanism.
+
+**Note**: The readiness check is accurate within a 2 second resolution. If you do use this probe for load balancing, be aware that bursts of traffic within that 2 second window could all be allocated to a single Container since its readiness state will not change.
+return {"ready": True, "engines_used": self.engines_used}
+Possible responses:
+
+- `200` if the container has a free connection slot.
+- `503` otherwise.
+
+In the body of the response there is also a JSON object with the current status.
+
+Example:
+
+```bash-and-response
+$ curl -i address.of.container:PORT/ready
+HTTP/1.0 200 OK
+Server: BaseHTTP/0.6 Python/3.8.5
+Date: Mon, 08 Feb 2021 12:47:05 GMT
+Content-Type: application/json
+{
+    "ready": true,
+    "engines_used": 2
+}
+```
+
 ## Realtime transcription
 
 The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file.

From a32e122a7db020a12407eaaff9b03ecb442051ad Mon Sep 17 00:00:00 2001
From: Georgios Hadjiharalambous <georgiosh@speechmatics.com>
Date: Wed, 18 Mar 2026 17:34:03 +0000
Subject: [PATCH 2/7] small fixes to be able to render

---
 docs/deployments/container/cpu-speech-to-text.mdx | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
index 53aa8531..f517b28d 100644
--- a/docs/deployments/container/cpu-speech-to-text.mdx
+++ b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -263,10 +263,10 @@ With curl:
     -F 'data_file=@~/audio_file.mp3'
 ```
 Returns:
-                on success:json string containing job id: {"job_id": "abcdefgh01"} and HTTP status code 201
-                on failure: technically it raises but the exception is translated to HTTP status code != 200:
-                HTTP status code 503 for server busy
-                HTTP status code 400 for invalid request
+on success: json string containing job id: `{"job_id": "abcdefgh01"}` and HTTP status code 201
+on failure: technically it raises but the exception is translated to HTTP status code != 200:
+  HTTP status code 503 for server busy
+  HTTP status code 400 for invalid request
 
 with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription):
 ```
@@ -386,7 +386,7 @@ This endpoint provides a readiness probe. It can be queried using an HTTP GET re
 The container has been designed to process multiple audio streams at a time. This probe indicates whether the container has a slot free for connections, and can be used as a scaling mechanism.
 
 **Note**: The readiness check is accurate within a 2 second resolution. If you do use this probe for load balancing, be aware that bursts of traffic within that 2 second window could all be allocated to a single Container since its readiness state will not change.
-return {"ready": True, "engines_used": self.engines_used}
+return `{"ready": True, "engines_used": self.engines_used}`
 Possible responses:
 
 - `200` if the container has a free connection slot.

From 54503716fc50961b08f9fda71867cdf4f902a7f5 Mon Sep 17 00:00:00 2001
From: Georgios Hadjiharalambous <georgiosh@speechmatics.com>
Date: Wed, 18 Mar 2026 17:37:25 +0000
Subject: [PATCH 3/7] more fixes

---
 docs/deployments/container/cpu-speech-to-text.mdx | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
index f517b28d..6a5085b4 100644
--- a/docs/deployments/container/cpu-speech-to-text.mdx
+++ b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -295,7 +295,7 @@ limit: maximum number of jobs to return, can be between 1 and 100
 
 returns: list of jobs
 
-/v2/jobs/{job_id}/transcript
+`/v2/jobs/{job_id}/transcript`
 
 args: job_id and format of the transcript. Options for the transcript currently are : "json", "txt", "srt" (we might need to add an ALL option). Maybe we can return all due to the nature of the http requests, but all formats are probably saved already locally?(todo find out)
 
@@ -308,12 +308,12 @@ if the job hasn’t finished, returns a 404, and includes the status and request
 if the format is not in our included list we return a 404 with error = unsupported format
 
 
-/v2/jobs/{job_id}
+`/v2/jobs/{job_id}`
 
 returns job status, including job_id and request_id
 
 
-/v2/jobs/{job_id}/log
+`/v2/jobs/{job_id}/log`
 
 returns the logs for the specific job
 
@@ -332,7 +332,9 @@ The Health Service offers four endpoints:
 #### `/sessions`
 
 
-                    f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding
+```python (TODO GH REMOVE)
+f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding
+```
 
 Possible responses:
 

From edadd98a0ebec11966a7353c9467e15459438aa4 Mon Sep 17 00:00:00 2001
From: Georgios Hadjiharalambous <georgiosh@speechmatics.com>
Date: Wed, 18 Mar 2026 20:06:20 +0000
Subject: [PATCH 4/7] more improvements, adding actual results from the
 endpoints

---
 .../container/cpu-speech-to-text.mdx          | 86 +++++++++++++++++--
 1 file changed, 77 insertions(+), 9 deletions(-)

diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
index 6a5085b4..ee87fde1 100644
--- a/docs/deployments/container/cpu-speech-to-text.mdx
+++ b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -228,7 +228,7 @@ The following example shows how to use `--all-formats` parameter. In this scenar
 ## Batch persisted worker transcription
 
 Batch persisted workers (knows as http batch workers), are multi session capable persisted workers. They work utilizing an http server, which is able to 
-accept jobs through POST and by using the [V2 Batch REST API] (https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
+accept jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
 of posting a job, to checking the status of the jobs and asking for the transcript.
 
 
@@ -286,14 +286,59 @@ async def main():
 asyncio.run(main())
 ```
 
-## Job specific endpoints
+Regular lifecycle and how to set a job to use multiple engine, to reduce RTF
 
-/v2/jobs
+..add headers stuff ..
+
+explain what endpoint they need to check before starting a job.
+
+### Job specific endpoints
+
+`/v2/jobs`
 
 args: created_before: string in ISO 8601 format, only returns jobs created before this time
 limit: maximum number of jobs to return, can be between 1 and 100
 
 returns: list of jobs
+```json
+{
+  "jobs": [
+    {
+      "id": "191f47e4a4204fa4ac2b",
+      "created_at": "2026-03-18T19:27:42.436Z",
+      "data_name": "5_min",
+      "text_name": null,
+      "duration": 300,
+      "status": "RUNNING",
+      "config": {
+        "type": "transcription",
+        "transcription_config": {
+          "language": "en",
+          "diarization": "speaker",
+          "operating_point": "standard"
+        }
+      }
+    },
+    {
+      "id": "6dcb02e0dc5943e2b643",
+      "created_at": "2026-03-18T19:27:47.550Z",
+      "data_name": "5_min",
+      "text_name": null,
+      "duration": 300,
+      "status": "RUNNING",
+      "config": {
+        "type": "transcription",
+        "transcription_config": {
+          "language": "en",
+          "diarization": "speaker",
+          "operating_point": "standard"
+        }
+      }
+    }
+  ]
+}
+```
+
 
 `/v2/jobs/{job_id}/transcript`
 
@@ -312,26 +357,46 @@ if the format is not in our included list we return a 404 with error = unsupport
 
 returns job status, including job_id and request_id
 
+```json
+{
+  "job": {
+    "id": "191f47e4a4204fa4ac2b",
+    "created_at": "2026-03-18T19:27:42.436Z",
+    "data_name": "5_min",
+    "duration": 300,
+    "status": "DONE",
+    "config": {
+      "type": "transcription",
+      "transcription_config": {
+        "language": "en",
+        "diarization": "speaker",
+        "operating_point": "standard"
+      }
+    },
+    "request_id": "191f47e4a4204fa4ac2b"
+  }
+}
+```
 
 `/v2/jobs/{job_id}/log`
 
 returns the logs for the specific job
 
 
-## Health service
+### Health service
 
-The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready` and `session_status`. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around
+The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready`.. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around
 [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).
 
 The Health Service is enabled by default and runs as a subprocess of the main entrypoint to the container.
 
-### Endpoints
+#### Endpoints
 
-The Health Service offers four endpoints:
+The Health Service offers three endpoints:
 
 #### `/sessions`
 
-
+(TODO GH REMOVE)
 ```python (TODO GH REMOVE)
 f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding
 ```
@@ -351,7 +416,10 @@ Server: BaseHTTP/0.6 Python/3.8.5
 Date: Mon, 08 Feb 2021 12:46:21 GMT
 Content-Type: application/json
 {
-    "started": true
+  "request_ids": [
+    "978174b1564e40ccacba,2",
+    "52d532a2efcb4b78962b,2"
+  ]
 }
 ```
 

From 3689576cf85501175f1cfae1bdd8ca92be6bccb0 Mon Sep 17 00:00:00 2001
From: Georgios Hadjiharalambous <georgiosh@speechmatics.com>
Date: Fri, 20 Mar 2026 16:29:25 +0000
Subject: [PATCH 5/7] more improvements and full cycle documented

---
 .../container/cpu-speech-to-text.mdx          | 111 ++++++++++++------
 1 file changed, 74 insertions(+), 37 deletions(-)

diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
index ee87fde1..cb4f8492 100644
--- a/docs/deployments/container/cpu-speech-to-text.mdx
+++ b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -227,9 +227,13 @@ The following example shows how to use `--all-formats` parameter. In this scenar
 
 ## Batch persisted worker transcription
 
-Batch persisted workers (knows as http batch workers), are multi session capable persisted workers. They work utilizing an http server, which is able to 
-accept jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
-of posting a job, to checking the status of the jobs and asking for the transcript.
+This feature is available for onPrem containers only.
+
+Shall we mention the version which this is available too?????
+
+Batch persisted workers (known as http batch workers), are batch multi session capable persisted workers. They work utilizing an http server, which is able to 
+accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
+of posting a job, checking the status of the jobs and retrieving for the transcript.
 
 
 You can run the persisted worker with:
@@ -249,9 +253,13 @@ The parameters are:
 - `parallel` - The number of parallel sessions you want this container to have (Each session corresponds to one gpu connection). The more sessions the higher
   throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK).
 - `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats).
-  If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`
+  If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`.
+- `PORT` The port of your local environment you will forward to docker container's port.
+
+Do we need to say that they can set up the internal port via an env.variable as well? 
+`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to
 
-To submit a job you can either use curl directly or using the python sdk.
+To submit a job you can either use curl directly or use the python sdk.
 With curl:
 ```
     curl -X POST address.of.container:PORT/v2/jobs \
@@ -262,11 +270,14 @@ With curl:
             }' \
     -F 'data_file=@~/audio_file.mp3'
 ```
+
 Returns:
+```
 on success: json string containing job id: `{"job_id": "abcdefgh01"}` and HTTP status code 201
-on failure: technically it raises but the exception is translated to HTTP status code != 200:
+on failure: returns an HTTP status code != 200:
   HTTP status code 503 for server busy
   HTTP status code 400 for invalid request
+```
 
 with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription):
 ```
@@ -286,13 +297,45 @@ async def main():
 asyncio.run(main())
 ```
 
-Regular lifecycle and how to set a job to use multiple engine, to reduce RTF
+With the persisted batch worker you have the capability to submit multiple jobs on the same worker given it has enough free capacity to process them.
+You can figure the free capacity left by querying the `/ready` endpoint outlined below. The result of this endpoint will include (`engines_used`) the total number of engines being 
+used by the running jobs now. To calculate the number of free engines you subtract the initial number of parallel engines you spinned 
+up the worker (set using `--parallel=NUM`) minus the engines you currently use.
 
-..add headers stuff ..
+If as part of a job you request more engines that those free, the job won't be accepted and will return a 503 with:
 
-explain what endpoint they need to check before starting a job.
+`HTTP 503: Service Unavailable - {"detail":"Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"}`
 
-### Job specific endpoints
+
+By requesting more engines in parallel for a job, you are able to improve the turnaround time for the job.
+
+To request multiple engines in parallel for a job you need to add a header in the POST request called `X-SM-Processing-Data`, which receives as input a json dictionary.
+The specify the number of parallel engines you want you need to add to this header a dict with key `parallel_engines` and as value the number of engines you want.
+
+For example with curl:
+```
+    curl -X POST address.of.container:PORT/v2/jobs \
+    -H 'X-SM-Processing-Data: {"parallel_engines":2}' \
+    -F 'config={
+            "type":"transcription",
+            "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"}
+            }' \
+    -F 'data_file=@~/audio_file.mp3'
+```
+
+To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature using the same header as above `X-SM-Processing-Data`
+insert as a key `user_id`, and value the id of the user/customer.
+```
+    curl -X POST address.of.container:PORT/v2/jobs \
+    -H 'X-SM-Processing-Data: {"user_id":"MY_USER_ID"}' \
+    -F 'config={
+            "type":"transcription",
+            "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"}
+            }' \
+    -F 'data_file=@~/audio_file.mp3'
+```
+
+### Job API endpoints
 
 `/v2/jobs`
 
@@ -315,7 +358,7 @@ returns: list of jobs
         "transcription_config": {
           "language": "en",
           "diarization": "speaker",
-          "operating_point": "standard"
+          "operating_point": "enhanced"
         }
       }
     },
@@ -331,7 +374,7 @@ returns: list of jobs
         "transcription_config": {
           "language": "en",
           "diarization": "speaker",
-          "operating_point": "standard"
+          "operating_point": "enhanced"
         }
       }
     }
@@ -342,7 +385,7 @@ returns: list of jobs
 
 `/v2/jobs/{job_id}/transcript`
 
-args: job_id and format of the transcript. Options for the transcript currently are : "json", "txt", "srt" (we might need to add an ALL option). Maybe we can return all due to the nature of the http requests, but all formats are probably saved already locally?(todo find out)
+args: job_id and format of the transcript. Options for the format transcript currently are : "json", "txt", "srt".
 
 Returns the transcript for a specific job if it has finished, the format is a valid choice, and the job_id exists.
 
@@ -350,7 +393,7 @@ if the job_id doesn’t exist returns an HTTPException with 404.
 
 if the job hasn’t finished, returns a 404, and includes the status and request_id.
 
-if the format is not in our included list we return a 404 with error = unsupported format
+if the format is not in our included list we return a 404 with error = unsupported format.
 
 
 `/v2/jobs/{job_id}`
@@ -370,7 +413,7 @@ returns job status, including job_id and request_id
       "transcription_config": {
         "language": "en",
         "diarization": "speaker",
-        "operating_point": "standard"
+        "operating_point": "enhanced"
       }
     },
     "request_id": "191f47e4a4204fa4ac2b"
@@ -385,33 +428,25 @@ returns the logs for the specific job
 
 ### Health service
 
-The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready`.. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around
+The container is exposes an http Health Service, which offers liveness, readiness, and session listing probes. This is accessible from the same port
+as job posting, and has 3 endpoints, `live`, `ready` and `sessions`. This may be especially helpful if you are deploying the container into a Kubernetes
+cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around
 [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).
 
-The Health Service is enabled by default and runs as a subprocess of the main entrypoint to the container.
-
 #### Endpoints
 
 The Health Service offers three endpoints:
 
 #### `/sessions`
 
-(TODO GH REMOVE)
-```python (TODO GH REMOVE)
-f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding
-```
-
-Possible responses:
-
-- `200` if all of the services in the container have successfully started.
-
-A JSON object is also returned in the body of the response, indicating the status.
+This endpoint provides a list of the currently running jobs. It can be queried using an HTTP GET request.
+Returns a list of the currently running jobs, which has a comma separate string of request_id and parallel_engines used for this job pair.
 
 Example:
 
 ```bash-and-response
 $ curl -i address.of.container:PORT/sessions
-HTTP/1.0 200 OK
+HTTP/1.1 200 OK
 Server: BaseHTTP/0.6 Python/3.8.5
 Date: Mon, 08 Feb 2021 12:46:21 GMT
 Content-Type: application/json
@@ -425,14 +460,13 @@ Content-Type: application/json
 
 #### `/live`
 
-This endpoint provides a liveness probe. It can be queried using an HTTP GET request. You must include the relevant port, 8001, in the request.
+This endpoint provides a liveness probe. It can be queried using an HTTP GET request.
 
-This probe indicates whether all services in the Container are active. The services in the Container send regular updates to the Health Service, if they don't send an update for more than 10 seconds then they will be marked as 'dead' and this endpoint will return an unsuccessful response code. For example, if the WebSocket server in the Container were to crash, this endpoint should indicate that.
+This probe indicates whether all services in the Container are active. 
 
 Possible responses:
 
 - `200` if all of the services in the Container have successfully started, and have recently sent an update to the Health Service.
-- `503` otherwise.
 
 A JSON object is also returned in the body of the response, indicating the status.
 
@@ -440,7 +474,7 @@ Example:
 
 ```bash-and-response
 $ curl -i address.of.container:PORT/live
-HTTP/1.0 200 OK
+HTTP/1.1 200 OK
 Server: BaseHTTP/0.6 Python/3.8.5
 Date: Mon, 08 Feb 2021 12:46:45 GMT
 Content-Type: application/json
@@ -453,22 +487,21 @@ Content-Type: application/json
 
 This endpoint provides a readiness probe. It can be queried using an HTTP GET request.
 
-The container has been designed to process multiple audio streams at a time. This probe indicates whether the container has a slot free for connections, and can be used as a scaling mechanism.
+The container has been designed to process multiple jobs cuncurrently. This probe indicates whether the container has one slot (one engine) free for connections, and can be used as a scaling mechanism.
 
-**Note**: The readiness check is accurate within a 2 second resolution. If you do use this probe for load balancing, be aware that bursts of traffic within that 2 second window could all be allocated to a single Container since its readiness state will not change.
 return `{"ready": True, "engines_used": self.engines_used}`
 Possible responses:
 
 - `200` if the container has a free connection slot.
 - `503` otherwise.
 
-In the body of the response there is also a JSON object with the current status.
+In the body of the response there is also a JSON object with the current status, and the total number of engines being used.
 
 Example:
 
 ```bash-and-response
 $ curl -i address.of.container:PORT/ready
-HTTP/1.0 200 OK
+HTTP/1.1 200 OK
 Server: BaseHTTP/0.6 Python/3.8.5
 Date: Mon, 08 Feb 2021 12:47:05 GMT
 Content-Type: application/json
@@ -478,6 +511,10 @@ Content-Type: application/json
 }
 ```
 
+Environment variables:
+
+`SM_BATCH_WORKER_MAX_JOB_HISTORY` : This is the maximum number of job records to keep in memory
+
 ## Realtime transcription
 
 The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file.

From 779a4578cd88dcbf2977daca677a6d2b4adba4d3 Mon Sep 17 00:00:00 2001
From: Georgios Hadjiharalambous <georgiosh@speechmatics.com>
Date: Fri, 20 Mar 2026 16:31:10 +0000
Subject: [PATCH 6/7] add bold to unknown if we should include

---
 docs/deployments/container/cpu-speech-to-text.mdx | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
index cb4f8492..6722ebb7 100644
--- a/docs/deployments/container/cpu-speech-to-text.mdx
+++ b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -227,9 +227,9 @@ The following example shows how to use `--all-formats` parameter. In this scenar
 
 ## Batch persisted worker transcription
 
-This feature is available for onPrem containers only.
+**This feature is available for onPrem containers only.**
 
-Shall we mention the version which this is available too?????
+**Shall we mention the version which this is available too?????**
 
 Batch persisted workers (known as http batch workers), are batch multi session capable persisted workers. They work utilizing an http server, which is able to 
 accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
@@ -256,8 +256,8 @@ The parameters are:
   If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`.
 - `PORT` The port of your local environment you will forward to docker container's port.
 
-Do we need to say that they can set up the internal port via an env.variable as well? 
-`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to
+**Do we need to say that they can set up the internal port via an env.variable as well? 
+`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to**
 
 To submit a job you can either use curl directly or use the python sdk.
 With curl:

From 9666e13964eb9735ef304a487ba44627ecb45701 Mon Sep 17 00:00:00 2001
From: Georgios Hadjiharalambous <georgiosh@speechmatics.com>
Date: Fri, 20 Mar 2026 16:37:49 +0000
Subject: [PATCH 7/7] add more comments why to use batch worker

---
 docs/deployments/container/cpu-speech-to-text.mdx | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx
index 6722ebb7..14cd28ef 100644
--- a/docs/deployments/container/cpu-speech-to-text.mdx
+++ b/docs/deployments/container/cpu-speech-to-text.mdx
@@ -235,7 +235,12 @@ Batch persisted workers (known as http batch workers), are batch multi session c
 accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle
 of posting a job, checking the status of the jobs and retrieving for the transcript.
 
+The main benefit of this worker vs normal batch is that you don't incur the cost of spinning up the worker for each you want to transcribe.
+This has the benefit of reduding the turnaround time, especially for smaller files. The memory utilization is reduced as now you can have
+multiple jobs running in parallel in the same container sharing the memory, and remove the need to spin up mulitple container incuring the same memory cost as many times.
+Better utilizing the gpu as now we don't have initial setup times for the worker, and we are able to use the gpu uninterrupted.
 
+### How to run the worker and submit jobs to it
 You can run the persisted worker with:
 
 <CodeBlock language="bash">