AiNews.com
Posts
Hume’s Octave TTS Understands Context, Outperforms ElevenLabs

Hume’s Octave TTS Understands Context, Outperforms ElevenLabs

Alicia Shapiro
March 03, 2025 • Estimated Reading Time: 7 minutes

A futuristic AI-driven text-to-speech interface showcasing Hume’s Octave TTS system. The image features a sleek digital dashboard displaying a waveform visualizer, a script input box, and an AI-generated voice preview. A glowing neural network in the background represents Octave’s ability to interpret speech context, emotions, and expressions. The setting conveys innovation in synthetic voice technology and the evolution of expressive AI-generated speech.

Image Source: ChatGPT-4o

Hume’s Octave TTS Understands Context, Outperforms ElevenLabs

Hume has introduced Octave (Omni-capable text and voice engine), the first large language model (LLM) for text-to-speech (TTS). Unlike conventional TTS systems that merely read text aloud, Octave is a speech-language model that understands context, emotion, and nuance—allowing it to generate lifelike voices with expressive intonation.

Octave can:

Act out characters with distinct vocal personalities.
Generate voices from simple text prompts.
Modify speech style and emotion based on user instructions.

In blind comparison testing, 180 human raters preferred Octave’s voice generations over ElevenLabs’ in:

Audio quality – 71.6% of trials.
Naturalness – 51.7% of trials.
Matching voice descriptions – 57.7% of trials.

These results suggest Octave delivers superior expressive speech generation, a key advancement in AI-driven TTS.

How Octave Generates More Natural Speech

Trained as a large language model for speech, Octave doesn’t just read words—it interprets meaning, rhythm, tone, and structure to create more natural-sounding voices.

Context-Aware Speech Adjustments:

Interprets sarcasm and adjusts tone accordingly.
Recognizes emotions like revulsion, fear, or excitement and modulates speech appropriately.
Emphasizes key words and phrases based on meaning, much like a human actor.

For example, given a sarcastic script, Octave automatically adopts a sarcastic tone, or if prompted with a panicked phrase, it injects urgency into the voice.

Voice Design: Creating AI Voices from Prompts

With Octave’s Voice Design feature, users can create unique AI voices using only a text description. Whether it’s a calm therapist, a dramatic medieval knight, or a movie trailer narrator, Octave tailors the voice to match.

Users can input a description like "gruff goblin auctioneer with a Cockney accent", and Octave will generate a fitting voice.
Voices can be customized further with accents, speaking styles, and personality traits.
Users can also skip descriptions entirely, allowing Octave to generate voices based purely on the script in the Playground.

Visit their documentation for Octave's prompting guide.

Acting Instructions: Adopting Any Speaking Style

Octave can modify speech styles dynamically, just like a voice actor taking direction. Users can specify emotions and tones:

Whispering, hushed – For secrecy.
Calm, serene – For a peaceful delivery.
Disgusted, disdainful – To add attitude.
Angry, furious – To inject intensity.
Pained, shocked – To express distress.
Shouting, triumphant – For excitement and victory.

This feature enables realistic and emotionally responsive AI-generated speech, a breakthrough in synthetic voice technology. This feature is available through their Acting Instructions tool.

To hear the examples of their voice generation, please visit their blog post.

Benchmarking: Octave vs. ElevenLabs

To evaluate Octave’s performance, Hume conducted an internal benchmarking study against ElevenLabs’ TTS system. Using 120 different voice descriptions to reflect the diverse user voice characteristics, human raters blindly compared the expressiveness, accuracy, and overall quality of the generated speech. Using Gemini, Hume created plausible dialogue for each voice description. For every voice prompt and text input, they then generated three samples with Octave and three with ElevenLabs Voice Design.

A bar chart comparing human preference ratings for Hume’s Octave and ElevenLabs' TTS models across three metrics: Naturalness, Description Match, and Audio Quality. Octave, represented in purple, outperforms ElevenLabs, shown in beige, in all categories, with the most significant advantage in audio quality.

Hume Octave vs. ElevenLabs – Preference by Metric. Image Source: Huma

Results:

Octave’s voices were preferred in 71.6% of trials for audio quality.
Octave's Naturalness was favored in 51.7% of trials.
Octave’s speech better matched the intended style and prompt in 57.7% of cases.

These findings indicate that Octave not only produces more natural speech but also follows detailed user prompts more accurately.

A triangular radar chart illustrating how Hume’s Octave and ElevenLabs’ TTS models compare in Naturalness, Description Match, and Audio Quality. Octave, shown in purple, leads across all metrics, especially in audio quality, where it scores the highest. ElevenLabs, represented in beige, trails behind in each category.

Hume Octave vs. ElevenLabs – Preference Radar Chart. Image Source: Hume

Expressive TTS Arena: A New Public Evaluation Platform

To advance TTS benchmarking, Hume is launching Expressive TTS Arena, a public testing platform where users can compare and rate AI-generated voices.

Unlike traditional TTS evaluations that focus on short, isolated phrases, this new platform will:

Test longer, more expressive speech samples.
Evaluate how well models respond to complex voice descriptions.
Measure steerability—how well the AI adapts to user instructions.

This initiative aims to set new industry standards for expressive and dynamic AI-generated speech.

Octave Creator Studio & Developer Tools

Octave is available today via:

Hume’s platform (platform.hume.ai) for creators and developers.
API integration through Python and TypeScript SDKs.
A Projects interface for generating long-form content like audiobooks and podcasts (currently in Preview).

Developers can access prebuilt voices, voice design tools, and customizable speech generation features, reducing time-to-market for voice-enabled applications.

What’s Next?

Hume plans to continue improving Octave, focusing on:

Expanding language support beyond English and Spanish.
Refining expressive speech capabilities for even more nuanced AI voices.
Launching Voice Cloning, allowing users to generate AI voices from 5 seconds of audio samples, and will be available in the coming weeks.

While Octave is already setting a new standard for TTS, Hume envisions broader applications—training AI systems that understand human emotion and anticipate user needs.

What This Means

Octave represents a major leap forward in AI-generated speech, moving beyond basic text reading to context-aware, emotionally expressive voice synthesis.

For businesses & developers: More natural, customizable AI voices for virtual assistants, audiobooks, gaming, and content creation.
For consumers: AI-generated speech that sounds more human-like, expressive, and engaging.
For the industry: A shift toward LLM-powered speech models that understand and interpret language in real-time.

With superior voice quality, better prompt adherence, and enhanced expressiveness, Octave challenges existing TTS systems—potentially redefining AI-generated voices as we know them.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.