diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 60b63656..14cd28ef 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -225,6 +225,301 @@ The following example shows how to use `--all-formats` parameter. In this scenar +## Batch persisted worker transcription + +**This feature is available for onPrem containers only.** + +**Shall we mention the version which this is available too?????** + +Batch persisted workers (known as http batch workers), are batch multi session capable persisted workers. They work utilizing an http server, which is able to +accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle +of posting a job, checking the status of the jobs and retrieving for the transcript. + +The main benefit of this worker vs normal batch is that you don't incur the cost of spinning up the worker for each you want to transcribe. +This has the benefit of reduding the turnaround time, especially for smaller files. The memory utilization is reduced as now you can have +multiple jobs running in parallel in the same container sharing the memory, and remove the need to spin up mulitple container incuring the same memory cost as many times. +Better utilizing the gpu as now we don't have initial setup times for the worker, and we are able to use the gpu uninterrupted. + +### How to run the worker and submit jobs to it +You can run the persisted worker with: + + + {`docker run -it \\ + -e LICENSE_TOKEN=$TOKEN_VALUE \\ + -p PORT:18000 \\ + batch-asr-transcriber-en:${smVariables.latestContainerVersion} \\ + --run-mode http \\ + --parallel=4 \\ + --all-formats /output_dir_name +`} + + +The parameters are: +- `parallel` - The number of parallel sessions you want this container to have (Each session corresponds to one gpu connection). The more sessions the higher + throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK). +- `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats). + If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`. +- `PORT` The port of your local environment you will forward to docker container's port. + +**Do we need to say that they can set up the internal port via an env.variable as well? +`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to** + +To submit a job you can either use curl directly or use the python sdk. +With curl: +``` + curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"parallel_engines":2, "user_id":"MY_USER_ID"}' \ + -F 'config={ + "type":"transcription", + "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} + }' \ + -F 'data_file=@~/audio_file.mp3' +``` + +Returns: +``` +on success: json string containing job id: `{"job_id": "abcdefgh01"}` and HTTP status code 201 +on failure: returns an HTTP status code != 200: + HTTP status code 503 for server busy + HTTP status code 400 for invalid request +``` + +with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription): +``` +import asyncio +import os +from dotenv import load_dotenv +from speechmatics.batch import AsyncClient + +load_dotenv() + +async def main(): + client = AsyncClient(api_key=os.getenv("SPEECHMATICS_API_KEY"), url="address.of.container:PORT/v2") + result = await client.transcribe("audio.wav",parallel_engines=2, user_id="MY_USER_ID") + print(result.transcript_text) + await client.close() + +asyncio.run(main()) +``` + +With the persisted batch worker you have the capability to submit multiple jobs on the same worker given it has enough free capacity to process them. +You can figure the free capacity left by querying the `/ready` endpoint outlined below. The result of this endpoint will include (`engines_used`) the total number of engines being +used by the running jobs now. To calculate the number of free engines you subtract the initial number of parallel engines you spinned +up the worker (set using `--parallel=NUM`) minus the engines you currently use. + +If as part of a job you request more engines that those free, the job won't be accepted and will return a 503 with: + +`HTTP 503: Service Unavailable - {"detail":"Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"}` + + +By requesting more engines in parallel for a job, you are able to improve the turnaround time for the job. + +To request multiple engines in parallel for a job you need to add a header in the POST request called `X-SM-Processing-Data`, which receives as input a json dictionary. +The specify the number of parallel engines you want you need to add to this header a dict with key `parallel_engines` and as value the number of engines you want. + +For example with curl: +``` + curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"parallel_engines":2}' \ + -F 'config={ + "type":"transcription", + "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} + }' \ + -F 'data_file=@~/audio_file.mp3' +``` + +To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature using the same header as above `X-SM-Processing-Data` +insert as a key `user_id`, and value the id of the user/customer. +``` + curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"user_id":"MY_USER_ID"}' \ + -F 'config={ + "type":"transcription", + "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} + }' \ + -F 'data_file=@~/audio_file.mp3' +``` + +### Job API endpoints + +`/v2/jobs` + +args: created_before: string in ISO 8601 format, only returns jobs created before this time +limit: maximum number of jobs to return, can be between 1 and 100 + +returns: list of jobs +```json +{ + "jobs": [ + { + "id": "191f47e4a4204fa4ac2b", + "created_at": "2026-03-18T19:27:42.436Z", + "data_name": "5_min", + "text_name": null, + "duration": 300, + "status": "RUNNING", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + } + }, + { + "id": "6dcb02e0dc5943e2b643", + "created_at": "2026-03-18T19:27:47.550Z", + "data_name": "5_min", + "text_name": null, + "duration": 300, + "status": "RUNNING", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + } + } + ] +} +``` + + +`/v2/jobs/{job_id}/transcript` + +args: job_id and format of the transcript. Options for the format transcript currently are : "json", "txt", "srt". + +Returns the transcript for a specific job if it has finished, the format is a valid choice, and the job_id exists. + +if the job_id doesn’t exist returns an HTTPException with 404. + +if the job hasn’t finished, returns a 404, and includes the status and request_id. + +if the format is not in our included list we return a 404 with error = unsupported format. + + +`/v2/jobs/{job_id}` + +returns job status, including job_id and request_id + +```json +{ + "job": { + "id": "191f47e4a4204fa4ac2b", + "created_at": "2026-03-18T19:27:42.436Z", + "data_name": "5_min", + "duration": 300, + "status": "DONE", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + }, + "request_id": "191f47e4a4204fa4ac2b" + } +} +``` + +`/v2/jobs/{job_id}/log` + +returns the logs for the specific job + + +### Health service + +The container is exposes an http Health Service, which offers liveness, readiness, and session listing probes. This is accessible from the same port +as job posting, and has 3 endpoints, `live`, `ready` and `sessions`. This may be especially helpful if you are deploying the container into a Kubernetes +cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around +[liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). + +#### Endpoints + +The Health Service offers three endpoints: + +#### `/sessions` + +This endpoint provides a list of the currently running jobs. It can be queried using an HTTP GET request. +Returns a list of the currently running jobs, which has a comma separate string of request_id and parallel_engines used for this job pair. + +Example: + +```bash-and-response +$ curl -i address.of.container:PORT/sessions +HTTP/1.1 200 OK +Server: BaseHTTP/0.6 Python/3.8.5 +Date: Mon, 08 Feb 2021 12:46:21 GMT +Content-Type: application/json +{ + "request_ids": [ + "978174b1564e40ccacba,2", + "52d532a2efcb4b78962b,2" + ] +} +``` + +#### `/live` + +This endpoint provides a liveness probe. It can be queried using an HTTP GET request. + +This probe indicates whether all services in the Container are active. + +Possible responses: + +- `200` if all of the services in the Container have successfully started, and have recently sent an update to the Health Service. + +A JSON object is also returned in the body of the response, indicating the status. + +Example: + +```bash-and-response +$ curl -i address.of.container:PORT/live +HTTP/1.1 200 OK +Server: BaseHTTP/0.6 Python/3.8.5 +Date: Mon, 08 Feb 2021 12:46:45 GMT +Content-Type: application/json +{ + "live": true +} +``` + +#### `/ready` + +This endpoint provides a readiness probe. It can be queried using an HTTP GET request. + +The container has been designed to process multiple jobs cuncurrently. This probe indicates whether the container has one slot (one engine) free for connections, and can be used as a scaling mechanism. + +return `{"ready": True, "engines_used": self.engines_used}` +Possible responses: + +- `200` if the container has a free connection slot. +- `503` otherwise. + +In the body of the response there is also a JSON object with the current status, and the total number of engines being used. + +Example: + +```bash-and-response +$ curl -i address.of.container:PORT/ready +HTTP/1.1 200 OK +Server: BaseHTTP/0.6 Python/3.8.5 +Date: Mon, 08 Feb 2021 12:47:05 GMT +Content-Type: application/json +{ + "ready": true, + "engines_used": 2 +} +``` + +Environment variables: + +`SM_BATCH_WORKER_MAX_JOB_HISTORY` : This is the maximum number of job records to keep in memory + ## Realtime transcription The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file.