minimax-speech-ts
    Preparing search index...

    minimax-speech-ts

    MiniMax TTS SDK for JavaScript / TypeScript

    npm version CI npm downloads license

    Type-safe MiniMax TTS client for Node.js. Full API coverage — sync and streaming synthesis, voice cloning, voice design, and voice management — with a single runtime dependency. Ships ESM + CJS with complete TypeScript declarations. (Unofficial)

    API Reference | npm | GitHub

    • Full API coverage — sync, streaming (SSE), async, voice cloning, voice design, voice management
    • Zero confignpm install, pass your API key, get audio back
    • ReadableStream<Buffer> streaming — pipe directly to a file, HTTP response, or WebSocket
    • Typed error hierarchyinstanceof checks for auth, rate-limit, and validation errors
    • Client-side validation — catches bad params before the network round-trip
    • camelCase in, snake_case on the wire — no manual conversion needed
    • Dual output — ESM and CommonJS with .d.ts declarations
    1. Get an API key from platform.minimax.io
    2. npm install minimax-speech-ts
    3. Run:
    import { MiniMaxSpeech } from 'minimax-speech-ts'
    import fs from 'node:fs'

    const client = new MiniMaxSpeech({
    apiKey: process.env.MINIMAX_API_KEY!,
    groupId: process.env.MINIMAX_GROUP_ID, // optional
    })

    const result = await client.synthesize({
    text: 'Hello, world!',
    model: 'speech-02-hd',
    voiceSetting: { voiceId: 'English_expressive_narrator' },
    })

    await fs.promises.writeFile('output.mp3', result.audio) // → output.mp3
    const { audio } = await client.synthesizeStream({
    text: 'Stream me!',
    voiceSetting: { voiceId: 'English_expressive_narrator' },
    audioSetting: { format: 'mp3' },
    })

    const writer = fs.createWriteStream('output.mp3')
    for await (const chunk of audio) writer.write(chunk)
    writer.end()
    const result = await client.synthesize({
    text: 'I am so happy to meet you!',
    voiceSetting: { voiceId: 'English_expressive_narrator', emotion: 'happy' },
    })
    const file = new Blob([await fs.promises.readFile('sample.mp3')], { type: 'audio/mp3' })
    const upload = await client.uploadFile(file, 'voice_clone')
    await client.cloneVoice({ fileId: upload.file.fileId, voiceId: 'my-voice' })
    const voice = await client.designVoice({
    prompt: 'A warm female voice with a slight British accent',
    previewText: 'Hello, this is a preview.',
    voiceId: 'my-designed-voice',
    })

    Compared to calling the MiniMax API with raw fetch:

    • Automatic camelCase ↔ snake_case — write idiomatic JS, the SDK converts for the wire
    • Request validation — catches invalid params, emotion/model mismatches, and format conflicts before the network call
    • Typed errorsMiniMaxAuthError, MiniMaxRateLimitError, MiniMaxValidationError with statusCode and traceId
    • Streaming handled internally — SSE parsing and hex-to-Buffer decoding are built in
    • One dependency — only eventsource-parser for SSE; everything else is native Node.js
    new MiniMaxSpeech({
    apiKey: string // Required. MiniMax API key.
    groupId?: string // Optional. MiniMax group ID, appended as ?GroupId= query param.
    apiHost?: string // Optional. Defaults to 'https://api.minimax.io'.
    // For reduced TTFA, try 'https://api-uw.minimax.io'.
    })

    Synchronous text-to-speech. Returns decoded audio as a Buffer.

    const result = await client.synthesize({
    text: 'Hello!',
    model: 'speech-02-hd', // optional, defaults to 'speech-02-hd'
    voiceSetting: {
    voiceId: 'English_expressive_narrator',
    speed: 1.0,
    vol: 1.0,
    pitch: 0,
    emotion: 'happy', // speech-02-*/speech-2.6-*/speech-2.8-* only
    },
    audioSetting: {
    format: 'mp3', // 'mp3' | 'pcm' | 'flac' | 'wav' | 'pcmu_raw' | 'pcmu_wav' | 'opus'
    sampleRate: 32000,
    bitrate: 128000,
    channel: 1,
    },
    languageBoost: 'English',
    voiceModify: {
    pitch: 0, // -100 to 100
    intensity: 0, // -100 to 100
    timbre: 0, // -100 to 100
    soundEffects: 'robotic', // optional
    },
    timbreWeights: [ // mix multiple voices
    { voiceId: 'voice-1', weight: 0.5 },
    { voiceId: 'voice-2', weight: 0.5 },
    ],
    subtitleEnable: false,
    subtitleType: 'sentence', // 'sentence' | 'word' ('word_streaming' is streaming-only — use synthesizeStream)
    pronunciationDict: { tone: ['处理/(chǔ lǐ)'] },
    })

    result.audio // Buffer
    result.extraInfo // { audioLength, audioSampleRate, audioSize, bitrate, wordCount, usageCharacters, ... }
    result.traceId // string
    result.subtitleFile // string | undefined

    Pass outputFormat: 'url' to receive a URL string instead of a decoded buffer:

    const result = await client.synthesize({
    text: 'Hello!',
    outputFormat: 'url',
    })

    result.audio // string (URL)

    Streaming text-to-speech via SSE. Returns { audio, subtitle, extraInfo, traceId } — a ReadableStream<Buffer> of audio chunks plus three promises resolved from the final aggregated chunk: the subtitle file URL (string | undefined), the parsed extraInfo (ExtraInfo | undefined — audio length, size, billable characters, …), and the traceId (string | undefined) for MiniMax support.

    Drain audio first. subtitle, extraInfo, and traceId only settle once audio is being consumed (reading audio is what pumps the underlying SSE source). Awaiting them before reading or cancelling audio will hang. Use Promise.all([drainAudio, extraInfo]) if you need both concurrently. None of them ever reject — they resolve to undefined on early end, API error, transport error, or cancellation.

    streamOptions.excludeAggregatedAudio follows the MiniMax API default (false — the final chunk re-includes the full re-concatenated clip). That aggregated audio is never enqueued either way, so extraInfo/traceId are unaffected by this flag. Pass { excludeAggregatedAudio: true } to skip the redundant re-transmit and save bandwidth.

    WAV format is not supported in streaming mode.

    const { audio, subtitle, extraInfo, traceId } = await client.synthesizeStream({
    text: 'Hello, streaming world!',
    voiceSetting: { voiceId: 'English_expressive_narrator' },
    audioSetting: { format: 'mp3' },
    streamOptions: { excludeAggregatedAudio: true }, // optional — saves bandwidth
    subtitleEnable: true, // optional
    subtitleType: 'word_streaming', // 'word_streaming' is streaming-only
    })

    const writer = fs.createWriteStream('output.mp3')
    for await (const chunk of audio) {
    writer.write(chunk)
    }
    writer.end()

    const subtitleUrl = await subtitle // undefined unless subtitleEnable was set
    const info = await extraInfo // { audioLength, usageCharacters, … } or undefined
    const trace = await traceId // undefined if no final chunk arrived

    Async text-to-speech for long-form content. Submit a task then poll for completion.

    Provide either text or textFileId (mutually exclusive). WAV format is not supported.

    const task = await client.synthesizeAsync({
    text: 'A very long article...',
    voiceSetting: { voiceId: 'English_expressive_narrator' },
    })

    task.taskId // number
    task.fileId // number
    task.taskToken // string
    task.usageCharacters // number

    Poll the status of an async synthesis task. On success you get a fileId — use the MiniMax File API to retrieve the audio. The synthesized file is only available for 9 hours after success; retrieve and store it before then.

    const status = await client.querySynthesizeAsync(task.taskId)

    status.status // 'processing' | 'success' | 'failed' | 'expired'
    status.fileId // number (download file ID when status is 'success')

    Upload a file. purpose is one of voice_clone, prompt_audio (audio samples for voice cloning), or t2a_async_input (a text file feeding synthesizeAsync). Accepts a Blob or a ReadableStream<Uint8Array>.

    // Blob upload (buffered)
    const audioBlob = new Blob([await fs.promises.readFile('voice.mp3')], { type: 'audio/mp3' })
    const upload = await client.uploadFile(audioBlob, 'voice_clone')

    upload.file.fileId // number
    upload.file.bytes // number
    upload.file.filename // string

    For large files, pass a ReadableStream<Uint8Array> to upload without buffering the full payload in memory. The multipart body is assembled with per-chunk backpressure and cancellation propagation, so aborting the request cleanly releases the upstream source.

    import { Readable } from 'node:stream'
    import { createReadStream } from 'node:fs'

    const stream = Readable.toWeb(createReadStream('big-voice.wav')) as ReadableStream<Uint8Array>
    const upload = await client.uploadFile(stream, 'voice_clone', {
    filename: 'big-voice.wav',
    contentType: 'audio/wav', // optional, defaults to 'application/octet-stream'
    })

    List files filtered by purpose (voice_clone, prompt_audio, or t2a_async_input).

    const { files } = await client.listFiles({ purpose: 'voice_clone' })
    files[0].fileId // number
    files[0].filename // string
    files[0].bytes // number

    Retrieve metadata for a single file.

    const { file } = await client.retrieveFile(12345)
    file.bytes // number
    file.purpose // string
    file.createdAt // number — unix seconds

    Download the file bytes. Useful for fetching async-synthesis output once querySynthesizeAsync returns status: 'success'.

    const audio = await client.retrieveFileContent(task.fileId)
    await fs.promises.writeFile('output.mp3', audio)

    Delete a file. purpose accepts the upload purposes plus t2a_async (async synthesis output) and video_generation.

    await client.deleteFile({ fileId: 12345, purpose: 't2a_async' })
    

    Clone a voice from an uploaded audio file.

    const result = await client.cloneVoice({
    fileId: upload.file.fileId,
    voiceId: 'my-custom-voice', // 8-256 chars, must start with a letter
    text: 'Preview text', // optional preview
    model: 'speech-02-hd', // required if text is provided
    needNoiseReduction: true,
    needVolumeNormalization: true,
    clonePrompt: { // optional prompt-based cloning
    promptAudio: promptFileId,
    promptText: 'Transcript of the prompt audio',
    },
    })

    result.demoAudio // URL to preview audio (empty if no text provided)
    result.inputSensitive // { type: number } — 0 = normal; 1–7 categorize the safety trigger
    result.extraInfo // billing info (audioLength, usageCharacters, …) when text+model preview ran

    Design a new voice from a text description.

    const result = await client.designVoice({
    prompt: 'A warm female voice with a slight British accent',
    previewText: 'Hello, this is a preview of the designed voice.',
    voiceId: 'my-designed-voice', // optional, auto-generated if omitted
    })

    result.voiceId // string
    result.trialAudio // hex-encoded preview audio

    List available voices.

    const voices = await client.getVoices({
    voiceType: 'all', // 'system' | 'voice_cloning' | 'voice_generation' | 'all'
    })

    voices.systemVoice // SystemVoiceInfo[] — built-in voices
    voices.voiceCloning // VoiceCloningInfo[] — your cloned voices
    voices.voiceGeneration // VoiceGenerationInfo[] — your designed voices

    Delete a cloned or designed voice.

    const result = await client.deleteVoice({
    voiceType: 'voice_cloning', // 'voice_cloning' | 'voice_generation'
    voiceId: 'my-custom-voice',
    })

    The library provides a typed error hierarchy:

    import {
    MiniMaxClientError, // Client-side validation (bad params, before request is sent)
    MiniMaxError, // Base class for all API errors
    MiniMaxAuthError, // Authentication failures (codes 1004, 2042, 2049)
    MiniMaxRateLimitError, // Rate limiting (codes 1002, 1039, 1041, 2045, 2056)
    MiniMaxValidationError, // Server-side validation (codes 1008, 1026, 1027, 1042, 1043, 1044, 2013, 2037, 2039, 2048, 20132)
    } from 'minimax-speech-ts'
    try {
    await client.synthesize({ text: 'Hello' })
    } catch (e) {
    if (e instanceof MiniMaxClientError) {
    // Bad parameters — fix your request
    console.error(e.message)
    } else if (e instanceof MiniMaxAuthError) {
    // Invalid API key
    } else if (e instanceof MiniMaxRateLimitError) {
    // Back off and retry
    } else if (e instanceof MiniMaxValidationError) {
    // Server rejected the request parameters
    console.error(e.statusCode, e.statusMsg, e.traceId)
    } else if (e instanceof MiniMaxError) {
    // Other API error
    console.error(e.statusCode, e.statusMsg)
    }
    }

    Client-side validation catches common mistakes before making a request:

    • Missing required fields (text, voiceId, etc.)
    • Emotions with unsupported models (speech-01-* doesn't support emotions)
    • fluent/whisper emotions with non-speech-2.6-* models
    • WAV format in streaming or async mode
    • text and textFileId both provided (mutually exclusive)
    • text provided without model in voice cloning
    Model Emotions Notes
    speech-2.8-hd All except fluent, whisper Latest HD
    speech-2.8-turbo All except fluent, whisper Latest Turbo
    speech-2.6-hd All including fluent, whisper
    speech-2.6-turbo All including fluent, whisper
    speech-02-hd All except fluent, whisper Default
    speech-02-turbo All except fluent, whisper
    speech-01-hd None
    speech-01-turbo None

    The text field supports inline markup beyond plain content:

    • Pause control — insert <#x#> between text segments to pause for x seconds (range 0.01–99.99). Example: Hello<#0.5#>world.
    • Inline pronunciation — override the pronunciation of a word with Mandarin pinyin (tones 1–5), IPA, or Cantonese jyutping (tones 1–6), wrapped in half-width parentheses immediately after the word:
      • The word live is pronounced (lɪv) as a verb and (laɪv) as an adjective.
      • This is (he2)平, not (huo4)面.
      • 去街市買啲(sung3)。
    • Interjection tags (speech-2.8-hd / speech-2.8-turbo only) — embed natural speech sounds: (laughs), (chuckle), (coughs), (clear-throat), (groans), (breath), (pant), (inhale), (exhale), (gasps), (sniffs), (sighs), (snorts), (burps), (lip-smacking), (humming), (hissing), (emm), (sneezes).

    The API enforces these limits per account; the SDK surfaces 429-equivalent responses as MiniMaxRateLimitError. Build your own retry/backoff on top.

    Endpoint Limit
    synthesize / synthesizeStream / voice cloning 60 RPM
    designVoice 20 RPM
    querySynthesizeAsync 10 QPS
    • Voice-over generation — generate narration audio from scripts for videos and podcasts
    • Accessibility — add text-to-speech to web and Node.js applications
    • Voice cloning — clone a voice from a short audio sample and synthesize new speech
    • Voice design — create custom AI voices from text descriptions
    • Real-time TTS streaming — stream audio chunks via SSE for chatbots, virtual assistants, and live applications
    • Batch audio production — use async synthesis for long-form content like audiobooks and articles
    • Node.js >= 18 (uses native fetch and ReadableStream)
    • TypeScript >= 5.0
    • Works with any MiniMax API key from platform.minimax.io

    See CONTRIBUTING.md for development setup and guidelines.

    MIT