J. Pedro: Meet ElevenLabs' flagship audio models — find the right fit for your use case

Hi Joaquim pedro de,

Our flagship models are built to cover the full spectrum of your audio needs — from ultra-realistic speech to real-time synthesis and multilingual use cases. Meet our flagship models and find the right fit for your product.

Eleven v3 (alpha)

Most emotionnally rich, expressive speech model for dramatic delivery and performance

Eleven v3 (alpha) is the most expressive Text to Speech model. If you're working on videos, audiobooks, or media tools — this unlocks a new level of expressiveness.

Generate speech in 70+ languages
Multi-speaker dialogue
Audio tags like [excited], [whispers], and [sighs]

Audio tags live inline with your script and are formatted with lowercase square brackets. You can see more about audio tags in our prompting guide for v3 in the docs. Once generated, your audio can be downloaded as an MP3 file for immediate use.

Eleven Multilingual v2

Lifelike, consitent quality speech model with natural sounding output

Eleven Multilingual v2 produces natural, consistant, lifelike speech with high emotional range and contextual understanding across 29 languages while maintaining the speaker's unique characteristics and accent. Ideal for:

Audiobook production: Perfect for long-form narration with complex emotional delivery
Character voiceovers: Ideal for gaming and animation thanks to expressive range
Professional content: Well-suited for corporate videos and e-learning materials
Multilingual projects: Maintains consistent voice identity across languages

Eleven Flash v2.5

Ultra-low latency, affordable speech synthesis model

Our fastest speech synthesis model, built for real-time and high-volume use cases. Flash v2.5 delivers high-quality speech at ultra-low latency (~75ms) across 32 languages. It's cost-effective, scalable, and optimized for performance. Ideal for:

Conversational AI: Real-time voice agents and chatbots
Interactive apps: Games and applications requiring immediate response
Large-scale processing: Efficient for bulk text-to-speech conversion

Eleven Scribe v1

State-of-the-art speech recognition model

Our speech recognition model for transcription and speech analysis. Scribe v1 supports 99 languages and includes word-level timestamping, speaker diarization for multi-speaker recordings, and dynamic audio tagging for enhanced context. It's built to handle complex, multilingual audio at scale. Ideal for:

Transcription services: Accurate conversion of audio/video to text
Meeting documentation: Capture and document conversations with speaker tracking
Content analysis: Process large volumes of spoken content
Multilingual transcription: Accurate recognition across 99 languages

Eleven Music

Generate studio-quality track instantly

With Eleven Music, businesses, creators, artists, and every single one of our users can generate studio-grade music from natural language prompts with:

Complete control over genre, style, and structure
Vocals or just instrumental
Multi-lingual, including English, Spanish, German, Japanese and more
Edit the sound and lyrics of individual sections or the whole song

Created in collaboration with labels, publishers, and artists, Eleven Music is cleared for nearly all commercial uses, from film and television to podcasts and social media videos, and from advertisements to gaming. For more information on supported usage across our different plans, head here.

Model selection guide

Select the right model for your use case

Need full emotional range and dramatic delivery? Use Eleven v3 (alpha)
Need high-quality, expressive audio? Use Eleven Multilingual v2
Need real-time performance? Use Eleven Flash v2.5
Need multilingual + low latency? Use Multilingual v2 or Flash v2.5, depending on whether quality or speed is more important to your use case.
Need accurate transcription? Use Scribe v1
Need Music? Use Eleven Music