• AiNews.com
  • Posts
  • Sesame Unveils Expressive AI Voice Model for Natural Conversations

Sesame Unveils Expressive AI Voice Model for Natural Conversations

A futuristic AI assistant engaging in a natural conversation with a human in a modern workspace. The AI appears as a sleek, glowing digital interface, responding with expressive speech. On the holographic display, waveforms and text visualize the AI’s real-time speech generation, emphasizing its advanced conversational abilities and near-human expressiveness. The setting reflects a high-tech environment, symbolizing the future of AI-powered voice interactions.

Image Source: ChatGPT-4o

Sesame Unveils Expressive AI Voice Model for Natural Conversations

Sesame has introduced its Conversational Speech Model (CSM), a breakthrough in AI-generated speech designed to create more lifelike and responsive interactions. Unlike traditional text-to-speech (TTS) systems, CSM adapts to conversational context in real time, making AI voices more natural and expressive.

The company envisions this technology powering all-day wearable voice companions, offering a seamless, human-like experience in digital interactions.

How Sesame’s CSM Stands Out

Many AI voice assistants today sound mechanical and unnatural, struggling with conversational flow and emotional nuance. Sesame’s CSM model sets itself apart by focusing on:

  • Emotional intelligence: Reads and responds to emotional cues, adjusting tone and expression naturally.

  • Conversational dynamics: Handles interruptions, pauses, and emphasis like a real speaker.

  • Contextual awareness: Adapts speech style to match different situations.

  • Consistent personality: Maintains a coherent voice and identity over time.

These features enable more natural and engaging AI interactions, making CSM ideal for voice assistants, customer support, and interactive AI companions.

Advancing AI Speech with Context Awareness

Most AI speech models struggle with prosody and context, often generating robotic or unnatural speech. While recent models produce high-quality audio, they fail to account for tone, rhythm, and conversational history, leading to awkward or inappropriate responses.

CSM solves this by using a multimodal learning approach, integrating both text and speech history to generate coherent and context-aware responses.

Key Innovations in CSM:

  • Single-stage processing: Unlike traditional multi-step TTS systems, CSM generates speech in one unified process, improving both efficiency and expressivity.

  • Advanced evaluation suite: Since common public benchmarks are saturated, Sesame developed a new evaluation suite to better measure contextual understanding and prosody.

  • Real-time adaptability: The model adjusts tone and pacing dynamically, making interactions feel genuinely conversational.

  • Low latency: CSM operates with minimal delay, enabling fluid back-and-forth exchanges where users can interrupt and receive natural responses instantly.

How CSM Works: A Technical Overview

CSM builds on recent advancements in transformer-based speech modeling, particularly Residual Vector Quantization (RVQ) tokenization. This approach encodes both semantic tokens (capturing meaning) and acoustic tokens (capturing fine-grained details like speaker identity and intonation).

CSM’s Unique Structure - Two-Stage Transformer System:

  • Backbone Transformer: Processes interleaved text and audio to understand context.

  • Audio Decoder: Reconstructs high-fidelity speech from encoded tokens.

Compute Amortization:

  • Reduces memory overhead by training on only 1/16th of audio frames, improving efficiency without losing quality.

Scalability & Model Sizes - Three versions:

  • Tiny: 1B backbone, 100M decoder

  • Small: 3B backbone, 250M decoder

  • Medium: 8B backbone, 300M decoder

Benchmarking CSM: Measuring Human-Like Speech at Scale

To accurately assess CSM’s conversational capabilities, Sesame developed a new evaluation suite, addressing the limitations of existing benchmarks, which have become saturated with near-human performance scores. The suite focuses on four critical areas:

  • Faithfulness to text: Ensures the AI-generated speech accurately reflects the written input, measured using Word Error Rate (WER).

  • Contextual understanding: Evaluates how well the model interprets ambiguous words based on context. This includes the Homograph Disambiguation test, which checks if the model correctly pronounces words like “lead” (/lɛd/ for metal vs. /liːd/ for guiding).

  • Prosody & expressivity: Assesses naturalness in rhythm, pitch, and tone. The Pronunciation Continuation Consistency test examines whether the AI maintains a consistent pronunciation for words with multiple variations (e.g., “route” as /raʊt/ vs. /ruːt/).

  • Latency & responsiveness: Measures real-time performance, ensuring low delay in conversational speech generation.

Key Results & Findings

  • Near-human text accuracy: CSM achieved a Word Error Rate (WER) comparable to human speech, meaning its transcriptions and spoken words were highly faithful to the intended text.

    CSM’s accuracy was also measured using Word Error Rate (WER) and Speaker Similarity benchmarks. WER evaluates how faithfully AI speech reproduces text, while Speaker Similarity measures how consistent the AI maintains vocal characteristics. CSM’s Small model achieved a WER of 2.9%, matching human performance, while its speaker similarity score of *0.938 closely aligns with real human recordings (0.940), demonstrating highly consistent speech generation.

  • Pronunciation consistency: The model correctly maintained pronunciation across multi-turn conversations, scoring significantly higher than competing AI speech models from Play.ht, ElevenLabs, and OpenAI. To measure pronunciation accuracy and consistency, Sesame tested CSM against Play.ht, ElevenLabs, and OpenAI using two key benchmarks: Homograph Disambiguation (correct pronunciation of ambiguous words like "lead" and "bass") and Pronunciation Continuation Consistency (maintaining pronunciation across multi-turn conversations). The results show that CSM’s Medium model outperformed competitors in both categories

  • Human-level expressivity: In a Comparative Mean Opinion Score (CMOS) study, where evaluators compared AI speech to real human recordings:

  • Without context, CSM’s speech was indistinguishable from real human voices 50% of the time.

  • With context, human recordings were still preferred, indicating that while AI-generated prosody is improving, subtle gaps remain in conversational expressiveness.

    To assess how natural CSM sounds compared to human speech, Sesame conducted a Comparative Mean Opinion Score (CMOS) study. Evaluators listened to AI-generated and human speech samples and selected which sounded more natural. Without context, CSM’s speech was indistinguishable from human voices in 52.9% of cases. However, when conversational context was provided, human recordings were preferred 66.7% of the time—highlighting areas where AI prosody can still improve.

  • Reduced delay for real-time speech: Unlike traditional AI models that introduce lag due to multi-step processing, CSM generates speech with minimal latency, allowing for smooth back-and-forth conversation where users can interrupt and receive immediate responses.

Open-Sourcing Their Work

Sesame believes that advancing conversational AI should be a collaborative effort. To encourage innovation, the company is open-sourcing key components of its research, allowing developers and researchers to experiment, enhance, and expand its capabilities.

The models will be released under an Apache 2.0 license, enabling broad adoption and modification within the AI community.

You can view their GitHub for updates & contributions: Sesame AI Labs GitHub.

Limitations and Future Work

While CSM represents a significant advancement in conversational AI, it still has areas for improvement:

  • Limited multilingual support: The model is primarily trained on English data. Some multilingual ability emerges, but performance in other languages is not yet reliable.

  • No integration with pre-trained language models: CSM does not currently use knowledge from pre-trained models, which could enhance understanding and fluency.

  • Conversational structure challenges: While CSM captures speech content and prosody well, it does not yet model the broader structure of human conversations, such as turn-taking and natural pauses.

Future Plans

In the coming months, Sesame plans to:

  • Expand language support to over 20 languages by scaling model size and increasing dataset volume.

  • Integrate pre-trained language models to enhance conversational depth and context understanding.

  • Develop fully duplex models capable of handling interruptions, dynamic pacing, and real-time conversational flow, making AI speech even more natural.

Sesame envisions a future where AI-powered voice assistants can seamlessly interact in human-like conversations, requiring fundamental innovations in data curation, post-training methodologies, and multimodal learning.

What This Means: A Step Toward Human-Like AI Assistants

Sesame’s Conversational Speech Model (CSM) represents a major step toward AI-powered voice assistants that feel truly natural and interactive. With its real-time adaptability, contextual awareness, and near-human expressivity, CSM brings AI speech closer than ever to mimicking human conversations.

However, there’s still work to be done. While CSM excels in generating expressive speech, future advancements—such as multilingual expansion, deeper conversational structuring, and fully duplex interaction—will be key to bridging the remaining gap between AI and human communication.

By refining latency, conversational flow, and contextual intelligence, AI-powered assistants could become seamless companions in everyday life, enhancing everything from wearable voice interfaces to interactive customer service and AI-driven storytelling.

With rapid improvements in speech modeling and multimodal AI, the next wave of conversational AI will likely move beyond static responses to fluid, intuitive, and deeply engaging interactions—paving the way for a future where AI speaks, listens, and understands just like us.

For more technical details and to hear voice samples, please visit Sesame’s research blog.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.