Build Voice AI with the ElevenLabs API in Python: A 60-Minute Tutorial

Step-by-step ElevenLabs Python tutorial: signup, voice picker, Turbo and Eleven v3, WebSocket streaming, Hindi and Tamil. Around 50 lines of code.

4 May 2026 Updated 19 May 2026 ~10 min read

ElevenLabs developer docs landing page showing the Python quickstart and API reference navigation

Image: ElevenLabs product home, used for editorial coverage of the framework taught in this tutorial.

What you’ll need

A Python 3.9+ install, an internet connection, and a credit-card-free signup at elevenlabs.io. The free tier on ElevenLabs covers 10,000 credits per month per the public pricing page¹, where one credit equals one character on Multilingual v2 or Eleven v3, and 0.5 credits per character on Turbo or Flash. That works out to roughly 10,000 multilingual characters or 20,000 Turbo characters before the cap, which is enough for tutorial work, prototype demos, and a few rounds of voice tweaking. Prices fluctuate; verify current free-tier limits on the ElevenLabs pricing page before committing your build to a credit assumption. You will not need a paid plan to finish this tutorial.

The build target: a Python script that takes a string of text, picks an ElevenLabs voice, generates an MP3, and (optionally) streams the audio over a WebSocket for low-latency apps. The full pipeline is around 30 lines of Python and runs in roughly 60 minutes including signup. Total scope: install the SDK, list voices, generate audio with the Turbo model, save the file, stream over WebSocket, and add a multilingual variant for Hindi or Tamil text.

A note on the aggregator framing: this tutorial restates the steps documented in the ElevenLabs developer docs² and the official Python SDK README³. The code below is reproduced from those sources for editorial coverage, with regional framing on cost, language coverage, and pitfalls.

Step 1: Install the SDK and grab an API key

Install the official elevenlabs package from PyPI. The package is maintained by ElevenLabs themselves and is the supported path for Python integration³.

pip install elevenlabs

Then sign up at elevenlabs.io and visit the profile page in the dashboard. The API key is in the “API Keys” section. Set it as an environment variable so it never lands in version control:

export ELEVENLABS_API_KEY="your_key_here"

On Windows PowerShell, the equivalent is $env:ELEVENLABS_API_KEY = "your_key_here". The Python client reads this variable automatically when you instantiate it.

ElevenLabs developer documentation home page showing the Python SDK quickstart referenced in this step

Image: ElevenLabs developer documentation, used for editorial coverage of the SDK and API key flow.

Step 2: List the available voices

ElevenLabs ships a prebuilt voice library plus user-cloned voices on paid plans. For the free tier, the prebuilt voices are the practical option. The SDK exposes them via a simple list call:

import os
from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

voices = client.voices.get_all()
for voice in voices.voices[:10]:
 print(f"{voice.name:20} {voice.voice_id}")

Run this once and pick a voice_id you like. Common picks for English narration are “Rachel” and “Adam”; the SDK README and the dashboard both surface these as defaults³. Save the chosen voice_id as a constant in your script.

Step 3: Generate audio with the Turbo model

ElevenLabs offers several models per the docs², with Turbo positioned as the low-latency option and Eleven v3 (or its Multilingual v2 predecessor) positioned as the higher-fidelity option supporting more languages. For most app workloads where speed matters, Turbo is the right starting point.

from elevenlabs.client import ElevenLabs
from elevenlabs import play

client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

audio = client.text_to_speech.convert(
 voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel; replace with yours
 model_id="eleven_turbo_v2_5",
 text="Hello world. This is a test of the ElevenLabs Python SDK.",
 output_format="mp3_44100_128",
)

play(audio)

The convert call returns a generator of audio bytes. The play helper plays them locally; on Linux, this requires mpv or ffmpeg on the system path. If play errors out, skip it and write the bytes to a file instead, which we’ll do next.

ElevenLabs API reference page documenting the text-to-speech models including Turbo, Multilingual v2, and Eleven v3

Image: ElevenLabs API reference — text-to-speech, used for editorial coverage of the model lineup discussed in this step.

Step 4: Save the audio to a file

For most production paths, you want the audio on disk or in object storage rather than played live. The convert call returns chunks; concatenate and write:

audio = client.text_to_speech.convert(
 voice_id="21m00Tcm4TlvDq8ikWAM",
 model_id="eleven_turbo_v2_5",
 text="Hello world. This is a test of the ElevenLabs Python SDK.",
 output_format="mp3_44100_128",
)

with open("output.mp3", "wb") as f:
 for chunk in audio:
 if chunk:
 f.write(chunk)

print("Saved output.mp3")

The output_format string controls codec and bitrate. mp3_44100_128 gives 128 kbps MP3 at 44.1 kHz, which is fine for most app playback. The docs² list other formats including PCM 16-bit at various sample rates and ulaw 8000 (useful for telephony pipelines).

Step 5: Stream audio over WebSocket

For interactive apps, latency matters more than file fidelity. The streaming API returns audio bytes as the model generates them, so playback can start before the full sentence finishes. The official streaming reference covers the WebSocket payload⁴.

import os
from elevenlabs.client import ElevenLabs
from elevenlabs import stream

client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

audio_stream = client.text_to_speech.convert_as_stream(
 voice_id="21m00Tcm4TlvDq8ikWAM",
 model_id="eleven_turbo_v2_5",
 text=(
 "This sentence streams chunk by chunk. "
 "Playback starts before the whole sentence finishes generating."
 ),
)

stream(audio_stream)

The stream helper plays the audio as it arrives. For backend services serving audio to a frontend, you’d pipe audio_stream into your HTTP response or a WebSocket frame instead of stream(). The pattern is the same: iterate the generator, forward bytes downstream.

ElevenLabs API reference page covering the text-to-speech streaming endpoint and WebSocket payload

Image: ElevenLabs API reference — streaming, used for editorial coverage of the WebSocket streaming endpoint demonstrated in this step.

Step 6: Multilingual generation for Hindi, Tamil, Bengali, Urdu, and beyond

ElevenLabs offers two multilingual models per the docs²: Multilingual v2 (29 languages, stable since 2024) and Eleven v3, which entered general availability in February 2026 and expanded coverage to 70+ languages with improved emotion and pace controls⁵. New accounts created after February 2026 default to v3; v2 remains available as a legacy option.

A scope note for readers building voice content in Indian languages: per the ElevenLabs supported-languages list, Multilingual v2 supports Hindi and Tamil but does NOT cover Urdu, Bengali, Telugu, Kannada, Malayalam, Marathi, or Gujarati. Eleven v3 expands the supported set to 70+ languages including Urdu, Bengali, Telugu, Kannada, Malayalam, Marathi, and Gujarati. If your app targets one of these languages and your account is on v2 only, AI4Bharat’s IndicTTS⁶ and Sarvam Bulbul are alternatives; otherwise upgrade to v3 for the full Indic-language coverage.

For supported languages, the model picks language automatically from input text. For best fidelity, pick a voice that lists the target language as supported in the dashboard’s voice library; English-only voices may produce degraded pronunciation on non-English scripts.

audio = client.text_to_speech.convert(
 voice_id="21m00Tcm4TlvDq8ikWAM",
 model_id="eleven_v3",
 text="नमस्ते, यह ElevenLabs का परीक्षण है।",
 output_format="mp3_44100_128",
)

with open("output_hindi.mp3", "wb") as f:
 for chunk in audio:
 if chunk:
 f.write(chunk)

If your account predates February 2026 or you want the older model for parity reasons, swap eleven_v3 for eleven_multilingual_v2. The API surface is identical, but v2 covers fewer languages (no Urdu, Bengali, Telugu, Kannada, Malayalam, Marathi, or Gujarati).

A practical note on cost per the pricing page¹: Multilingual v2 and Eleven v3 cost 1 credit per character; Turbo and Flash cost 0.5 credits per character. The 10,000-credit free-tier cap covers roughly 20,000 Turbo characters or 10,000 multilingual characters per month. Start short, validate pronunciation, then scale up.

For apps targeting voice content in any supported language, the practical workflow is to draft scripts in the target language, run a short sample through Eleven v3 first, listen to the output, fix any phonetic-spelling issues for proper nouns, and only then generate the full audio. The cost of a bad full-length generation that requires a re-run is the same credit quota; the cost of a 50-character sample first is negligible.

Common pitfalls

The voice_id strings look opaque and are easy to typo. Copy them from the dashboard or from the get_all list rather than retyping. A 401 response usually means the API key is missing from the environment; verify with echo $ELEVENLABS_API_KEY (or $env:ELEVENLABS_API_KEY on PowerShell) before assuming the key is bad.

The play helper depends on system audio tools. On a Linux server with no audio device, play will throw; switch to writing to disk. On macOS, play works out of the box. On Windows, install ffmpeg and add it to PATH if the helper errors.

Free-tier credits reset monthly per the pricing page¹. If a script burns through the cap mid-development, the API returns 429 errors until the reset or until you upgrade. Cache generated audio aggressively during development so you’re not re-generating the same sentence every iteration.

Pronunciation of brand names, acronyms, and Hindi or Tamil proper nouns is approximate, not perfect. Eleven v3 handles common Hindi and Tamil words well; less-common proper nouns may need phonetic spelling in input text to land correctly.

Latency to ElevenLabs varies by geography (the docs do not publish region-specific first-byte numbers) and by ISP / time of day; expect longer first-byte times from India / SEA / Australia than from US / EU. For real-time voice apps, do not rely on one-shot blocking calls; use the streaming endpoint (Step 5) and start playback on the first chunk so the perceived latency is time-to-first-audio, not time-to-full-audio. The streaming pattern also reduces the chance of a timeout interrupting a long generation, which matters for telephony or IVR pipelines.

Where to go next

ElevenLabs official Python SDK GitHub repository README showing the supported features and integration paths

Image: elevenlabs/elevenlabs-python on GitHub, used for editorial coverage of the SDK’s roadmap and conversational-AI primitives referenced here.

The natural next steps after this tutorial: voice cloning (paid tier), conversational agents (the ElevenLabs Conversational AI product, which adds turn-taking and barge-in on top of the TTS primitives), and integration with an LLM for full text-to-voice agent loops.

A legal note before you clone any voice: ElevenLabs’ Terms of Service require explicit consent from the person whose voice you are cloning⁷. Cloning a celebrity, public figure, coworker, or any third party without their documented written permission violates the ToS and can attract liability under personality-rights and impersonation provisions in several jurisdictions, including under the IT Act 2000 and BNS 2023 frameworks applicable in India. The safe default: clone only your own voice, or a voice for which you have written consent on file.

For self-hosted or open-source paths where data residency or per-character cost is the dominant constraint, alternatives like Coqui TTS, Bark, and Piper are worth a look. The trade-off is voice naturalness and language coverage; ElevenLabs’ Eleven v3 remains the production-default for natural-sounding speech across 70+ languages, which is why this tutorial defaulted to it.

The complete code from steps 1 to 5 is around 30 lines once consolidated into a single file. The free-tier 10,000-credit cap¹ is enough to validate your use case before committing to a paid plan, and the Python SDK³ is stable enough that the same code will work in production with only credentials and error-handling changes.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. ElevenLabs pricing page, free tier 10,000 credits per month; 1 credit per character on Multilingual v2 or Eleven v3, 0.5 credits per character on Turbo or Flash (accessed 2026-05-05) ↩
2. ElevenLabs developer documentation home, model lineup, output formats, supported languages list (accessed 2026-05-05) ↩
3. elevenlabs/elevenlabs-python, official Python SDK README and code samples (accessed 2026-05-05) ↩
4. ElevenLabs API reference, text-to-speech streaming endpoint (accessed 2026-05-05) ↩
5. ElevenLabs blog, Eleven v3 general availability announcement, February 2026, 70+ language coverage (accessed 2026-05-18) ↩
6. AI4Bharat IndicTTS, open-source Indian-language TTS with broader Indic-language coverage than ElevenLabs (accessed 2026-05-05) ↩
7. ElevenLabs Terms of Service, voice-cloning consent requirement (accessed 2026-05-05) ↩