Reference

ApiStatus

client.api_status.get()

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.api_status.get()

⚙️ Parameters

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

Auth

client.auth.access_token(...)

📝 Description

Generates a new Access Token for the client. These tokens are short-lived and should be used to make requests to the API from authenticated clients.

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.auth.access_token(
    grants={"stt": True},
    expires_in=60,
)

⚙️ Parameters

grants: typing.Optional[TokenGrantParams] — The permissions to be granted via the token. Both TTS and STT grants are optional - specify only the capabilities you need.

expires_in: typing.Optional[int] — The number of seconds the token will be valid for since the time of generation. The maximum is 1 hour (3600 seconds).

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

Infill

client.infill.bytes(...)

📝 Description

Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.

Infilling is only available on sonic-2 at this time.

At least one of left_audio or right_audio must be provided.

As with all generative models, there's some inherent variability, but here's some tips we recommend to get the best results from infill:

Use longer infill transcripts
- This gives the model more flexibility to adapt to the rest of the audio
Target natural pauses in the audio when deciding where to clip
- This means you don't need word-level timestamps to be as precise
Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
- This helps the model generate more natural transitions

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.infill.bytes(
    model_id="sonic-2",
    language="en",
    transcript="middle segment",
    voice_id="694f9389-aac1-45b6-b726-9d9369183238",
    output_format_container="wav",
    output_format_sample_rate=44100,
    output_format_encoding="pcm_f32le",
    voice_experimental_controls_speed="slowest",
    voice_experimental_controls_emotion=["surprise:high", "curiosity:high"],
)

⚙️ Parameters

left_audio: `from future import annotations

core.File` — See core.File for more documentation

right_audio: `from future import annotations

core.File` — See core.File for more documentation

model_id: str — The ID of the model to use for generating audio

language: str — The language of the transcript

transcript: str — The infill text to generate

voice_id: str — The ID of the voice to use for generating audio

output_format_container: OutputFormatContainer — The format of the output audio

output_format_sample_rate: int — The sample rate of the output audio in Hz. Supported sample rates are 8000, 16000, 22050, 24000, 44100, 48000.

output_format_encoding: typing.Optional[RawEncoding] — Required for raw and wav containers.

output_format_bit_rate: typing.Optional[int] — Required for mp3 containers.

voice_experimental_controls_speed: typing.Optional[Speed]

Either a number between -1.0 and 1.0 or a natural language description of speed.

If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.

voice_experimental_controls_emotion: typing.Optional[typing.List[Emotion]]

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

request_options: typing.Optional[RequestOptions] — Request-specific configuration. You can pass in configuration such as chunk_size, and more to customize the request and response.

Stt

client.stt.transcribe(...)

📝 Description

Transcribes audio files into text using Cartesia's Speech-to-Text API.

Upload an audio file and receive a complete transcription response. Supports arbitrarily long audio files with automatic intelligent chunking for longer audio.

Supported audio formats: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm

Response format: Returns JSON with transcribed text, duration, and language. Include timestamp_granularities: ["word"] to get word-level timestamps.

Pricing: Batch transcription is priced at 1 credit per 2 seconds of audio processed.

For migrating from the OpenAI SDK, see our [OpenAI Whisper to Cartesia Ink Migration Guide](/api-reference/stt/migrate-from-open-ai).

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.stt.transcribe(
    model="ink-whisper",
    language="en",
    timestamp_granularities=["word"],
)

⚙️ Parameters

file: `from future import annotations

core.File` — See core.File for more documentation

model: str — ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

encoding: typing.Optional[SttEncoding]

The encoding format to process the audio as. If not specified, the audio file will be decoded automatically.

Supported formats:

pcm_s16le - 16-bit signed integer PCM, little-endian (recommended for best performance)
pcm_s32le - 32-bit signed integer PCM, little-endian
pcm_f16le - 16-bit floating point PCM, little-endian
pcm_f32le - 32-bit floating point PCM, little-endian
pcm_mulaw - 8-bit μ-law encoded PCM
pcm_alaw - 8-bit A-law encoded PCM

sample_rate: typing.Optional[int] — The sample rate of the audio in Hz.

language: typing.Optional[str]

The language of the input audio in ISO-639-1 format. Defaults to en.

- `en` (English) - `zh` (Chinese) - `de` (German) - `es` (Spanish) - `ru` (Russian) - `ko` (Korean) - `fr` (French) - `ja` (Japanese) - `pt` (Portuguese) - `tr` (Turkish) - `pl` (Polish) - `ca` (Catalan) - `nl` (Dutch) - `ar` (Arabic) - `sv` (Swedish) - `it` (Italian) - `id` (Indonesian) - `hi` (Hindi) - `fi` (Finnish) - `vi` (Vietnamese) - `he` (Hebrew) - `uk` (Ukrainian) - `el` (Greek) - `ms` (Malay) - `cs` (Czech) - `ro` (Romanian) - `da` (Danish) - `hu` (Hungarian) - `ta` (Tamil) - `no` (Norwegian) - `th` (Thai) - `ur` (Urdu) - `hr` (Croatian) - `bg` (Bulgarian) - `lt` (Lithuanian) - `la` (Latin) - `mi` (Maori) - `ml` (Malayalam) - `cy` (Welsh) - `sk` (Slovak) - `te` (Telugu) - `fa` (Persian) - `lv` (Latvian) - `bn` (Bengali) - `sr` (Serbian) - `az` (Azerbaijani) - `sl` (Slovenian) - `kn` (Kannada) - `et` (Estonian) - `mk` (Macedonian) - `br` (Breton) - `eu` (Basque) - `is` (Icelandic) - `hy` (Armenian) - `ne` (Nepali) - `mn` (Mongolian) - `bs` (Bosnian) - `kk` (Kazakh) - `sq` (Albanian) - `sw` (Swahili) - `gl` (Galician) - `mr` (Marathi) - `pa` (Punjabi) - `si` (Sinhala) - `km` (Khmer) - `sn` (Shona) - `yo` (Yoruba) - `so` (Somali) - `af` (Afrikaans) - `oc` (Occitan) - `ka` (Georgian) - `be` (Belarusian) - `tg` (Tajik) - `sd` (Sindhi) - `gu` (Gujarati) - `am` (Amharic) - `yi` (Yiddish) - `lo` (Lao) - `uz` (Uzbek) - `fo` (Faroese) - `ht` (Haitian Creole) - `ps` (Pashto) - `tk` (Turkmen) - `nn` (Nynorsk) - `mt` (Maltese) - `sa` (Sanskrit) - `lb` (Luxembourgish) - `my` (Myanmar) - `bo` (Tibetan) - `tl` (Tagalog) - `mg` (Malagasy) - `as` (Assamese) - `tt` (Tatar) - `haw` (Hawaiian) - `ln` (Lingala) - `ha` (Hausa) - `ba` (Bashkir) - `jw` (Javanese) - `su` (Sundanese) - `yue` (Cantonese)

timestamp_granularities: typing.Optional[typing.List[TimestampGranularity]] — The timestamp granularities to populate for this transcription. Currently only word level timestamps are supported.

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

Tts

client.tts.bytes(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.tts.bytes(
    model_id="sonic-2",
    transcript="Hello, world!",
    voice={"mode": "id", "id": "694f9389-aac1-45b6-b726-9d9369183238"},
    language="en",
    output_format={
        "sample_rate": 44100,
        "encoding": "pcm_f32le",
        "container": "raw",
    },
)

⚙️ Parameters

model_id: str — The ID of the model to use for the generation. See Models for available models.

transcript: str

voice: TtsRequestVoiceSpecifierParams

output_format: OutputFormatParams

language: typing.Optional[SupportedLanguage]

generation_config: typing.Optional[GenerationConfigParams]

duration: typing.Optional[float]

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

speed: typing.Optional[ModelSpeed]

request_options: typing.Optional[RequestOptions] — Request-specific configuration. You can pass in configuration such as chunk_size, and more to customize the request and response.

client.tts.sse(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
response = client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, world!",
    voice={"mode": "id", "id": "694f9389-aac1-45b6-b726-9d9369183238"},
    language="en",
    output_format={
        "container": "raw",
        "sample_rate": 44100,
        "encoding": "pcm_f32le",
    },
    add_timestamps=True,
)
for chunk in response:
    yield chunk

⚙️ Parameters

model_id: str — The ID of the model to use for the generation. See Models for available models.

transcript: str

voice: TtsRequestVoiceSpecifierParams

output_format: SseOutputFormatParams

language: typing.Optional[SupportedLanguage]

generation_config: typing.Optional[GenerationConfigParams]

duration: typing.Optional[float]

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

speed: typing.Optional[ModelSpeed]

add_timestamps: typing.Optional[bool] — Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.

add_phoneme_timestamps: typing.Optional[bool] — Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced - if add_timestamps is true, the produced timestamps will be word timestamps instead. If true, the server will return timestamp events containing phoneme-level timing information.

use_normalized_timestamps: typing.Optional[bool] — Whether to use normalized timestamps (True) or original timestamps (False).

context_id: typing.Optional[ContextId] — Optional context ID for this request.

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

VoiceChanger

client.voice_changer.bytes(...)

📝 Description

Takes an audio file of speech, and returns an audio file of speech spoken with the same intonation, but with a different voice.

This endpoint is priced at 15 characters per second of input audio.

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voice_changer.bytes(
    voice_id="694f9389-aac1-45b6-b726-9d9369183238",
    output_format_container="raw",
    output_format_sample_rate=44100,
    output_format_encoding="pcm_f32le",
)

⚙️ Parameters

clip: `from future import annotations

core.File` — See core.File for more documentation

voice_id: str

output_format_container: OutputFormatContainer

output_format_sample_rate: int — The sample rate of the output audio in Hz. Supported sample rates are 8000, 16000, 22050, 24000, 44100, 48000.

output_format_encoding: typing.Optional[RawEncoding] — Required for raw and wav containers.

output_format_bit_rate: typing.Optional[int] — Required for mp3 containers.

request_options: typing.Optional[RequestOptions] — Request-specific configuration. You can pass in configuration such as chunk_size, and more to customize the request and response.

client.voice_changer.sse(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
response = client.voice_changer.sse(
    voice_id="694f9389-aac1-45b6-b726-9d9369183238",
    output_format_container="raw",
    output_format_sample_rate=44100,
    output_format_encoding="pcm_f32le",
)
for chunk in response:
    yield chunk

⚙️ Parameters

clip: `from future import annotations

core.File` — See core.File for more documentation

voice_id: str

output_format_container: OutputFormatContainer

output_format_sample_rate: int

output_format_encoding: typing.Optional[RawEncoding] — Required for raw and wav containers.

output_format_bit_rate: typing.Optional[int] — Required for mp3 containers.

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

Voices

client.voices.list(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
response = client.voices.list()
for item in response:
    yield item
# alternatively, you can paginate page-by-page
for page in response.iter_pages():
    yield page

⚙️ Parameters

limit: typing.Optional[int] — The number of Voices to return per page, ranging between 1 and 100.

starting_after: typing.Optional[str]

A cursor to use in pagination. starting_after is a Voice ID that defines your place in the list. For example, if you make a /voices request and receive 100 objects, ending with voice_abc123, your subsequent call can include starting_after=voice_abc123 to fetch the next page of the list.

ending_before: typing.Optional[str]

A cursor to use in pagination. ending_before is a Voice ID that defines your place in the list. For example, if you make a /voices request and receive 100 objects, starting with voice_abc123, your subsequent call can include ending_before=voice_abc123 to fetch the previous page of the list.

is_owner: typing.Optional[bool] — Whether to only return voices owned by the current user.

is_starred: typing.Optional[bool] — Whether to only return starred voices.

gender: typing.Optional[GenderPresentation] — The gender presentation of the voices to return.

expand: typing.Optional[typing.Sequence[VoiceExpandOptions]] — Additional fields to include in the response.

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

client.voices.clone(...)

📝 Description

Clone a voice from an audio clip. This endpoint has two modes, stability and similarity.

Similarity mode clones are more similar to the source clip, but may reproduce background noise. For these, use an audio clip about 5 seconds long.

Stability mode clones are more stable, but may not sound as similar to the source clip. For these, use an audio clip 10-20 seconds long.

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voices.clone(
    name="A high-similarity cloned voice",
    description="Copied from Cartesia docs",
    mode="similarity",
    language="en",
)

⚙️ Parameters

clip: `from future import annotations

core.File` — See core.File for more documentation

name: str — The name of the voice.

language: SupportedLanguage — The language of the voice.

mode: CloneMode — Tradeoff between similarity and stability. Similarity clones sound more like the source clip, but may reproduce background noise. Stability clones always sound like a studio recording, but may not sound as similar to the source clip.

description: typing.Optional[str] — A description for the voice.

enhance: typing.Optional[bool] — Whether to apply AI enhancements to the clip to reduce background noise. This leads to cleaner generated speech at the cost of reduced similarity to the source clip.

base_voice_id: typing.Optional[VoiceId] — Optional base voice ID that the cloned voice is derived from.

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

client.voices.delete(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voices.delete(
    id="id",
)

⚙️ Parameters

id: VoiceId

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

client.voices.update(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voices.update(
    id="id",
    name="name",
    description="description",
)

⚙️ Parameters

id: VoiceId

name: str — The name of the voice.

description: str — The description of the voice.

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

client.voices.get(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voices.get(
    id="id",
)

⚙️ Parameters

id: VoiceId

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

client.voices.localize(...)

📝 Description

Create a new voice from an existing voice localized to a new language and dialect.

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voices.localize(
    voice_id="694f9389-aac1-45b6-b726-9d9369183238",
    name="Sarah Peninsular Spanish",
    description="Sarah Voice in Peninsular Spanish",
    language="es",
    original_speaker_gender="female",
    dialect="pe",
)

⚙️ Parameters

voice_id: str — The ID of the voice to localize.

name: str — The name of the new localized voice.

description: str — The description of the new localized voice.

language: LocalizeTargetLanguage

original_speaker_gender: Gender

dialect: typing.Optional[LocalizeDialectParams]

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

client.voices.mix(...)

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voices.mix(
    voices=[{"id": "id", "weight": 1.1}, {"id": "id", "weight": 1.1}],
)

⚙️ Parameters

voices: typing.Sequence[MixVoiceSpecifierParams]

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

client.voices.create(...)

📝 Description

Create voice from raw features. If you'd like to clone a voice from an audio file, please use Clone Voice instead.

🔌 Usage

from cartesia import Cartesia

client = Cartesia(
    api_key="YOUR_API_KEY",
)
client.voices.create(
    name="name",
    description="description",
    embedding=[1.1, 1.1],
)

⚙️ Parameters

name: str — The name of the voice.

description: str — The description of the voice.

embedding: Embedding

language: typing.Optional[SupportedLanguage]

base_voice_id: typing.Optional[BaseVoiceId]

request_options: typing.Optional[RequestOptions] — Request-specific configuration.

FilesExpand file tree

reference.md

Latest commit

History

reference.md

File metadata and controls

Reference

ApiStatus

🔌 Usage

⚙️ Parameters

Auth

📝 Description

🔌 Usage

⚙️ Parameters

Infill

📝 Description

🔌 Usage

⚙️ Parameters

Stt

📝 Description

🔌 Usage

⚙️ Parameters

Tts

🔌 Usage

⚙️ Parameters

🔌 Usage

⚙️ Parameters

VoiceChanger

📝 Description

🔌 Usage

⚙️ Parameters

🔌 Usage

⚙️ Parameters

Voices

🔌 Usage

⚙️ Parameters

📝 Description

🔌 Usage

⚙️ Parameters

🔌 Usage

⚙️ Parameters

🔌 Usage

⚙️ Parameters

🔌 Usage

⚙️ Parameters

📝 Description

🔌 Usage

⚙️ Parameters

🔌 Usage

⚙️ Parameters

📝 Description

🔌 Usage

⚙️ Parameters