Building Voice-Enabled Conversational AI with Rasa and Cartesia

Modern conversational AI demands more than just text interactions. With Rasa 3.11's native voice streaming capabilities, you can now create sophisticated voice-enabled bots that deliver natural, responsive conversations. In this hands-on tutorial, we'll explore how to leverage Rasa's integration with Deepgram for Speech-to-Text and Cartesia for Text-to-Speech to build a production-ready voice bot.

Why This Stack Matters

The combination of Rasa 3.11 with Cartesia is a significant advancement in voice-enabled conversational AI. While traditional voice bots often struggle with latency and natural speech patterns, Cartesia's architecture is specifically optimized for ultra-low latency voice synthesis. This means your bot can respond with near-human speed, making conversations feel natural and engaging.

What sets this integration apart is Cartesia's advanced neural TTS models, which produce voice output that closely mirrors human speech patterns, including proper intonation and emotional resonance. When combined with Rasa's native voice streaming capabilities, you get a powerful platform for building sophisticated voice interactions without the complexity of additional middleware.

The Power of Native Integration

Rasa 3.11's native support for voice streaming means you can:

Handle real-time voice input and output without additional middleware
Process streaming audio directly within your Rasa pipeline
Maintain conversation context seamlessly across voice interactions
Scale your voice bot efficiently using Rasa's CALM (Conversational AI with Language Models) framework

Through this tutorial, you'll learn how to leverage these capabilities to build a voice bot that can handle complex conversations while maintaining natural, fluid interactions. Whether you're building a customer service bot, a virtual assistant, or any other conversational application, this integration provides the foundation for sophisticated voice-enabled experiences.

Let's dive into the architecture and technical implementation details to bring your voice bot to life.

Architecture

The architecture diagram illustrates how Rasa's browser_audio connector streamlines voice handling:

The system leverages the CALM framework to manage dialogue while integrating directly with speech services. This architecture enables real-time audio streaming from the browser, direct voice processing through the speech service pipeline, seamless context maintenance across voice interactions, and efficient scaling through Rasa's enterprise-grade infrastructure.

Setting Up Your Development Environment

Before diving into the voice integration, let's set up a robust development environment. This tutorial uses Python 3.10.10 and Rasa Pro 3.11.3. For other installation options, refer to the Rasa Pro installation documentation.

First, verify your Python environment:

Shell

python --version# Ensure you're running Python 3.10.10
pip --version

We'll use uv as our package manager for faster dependency resolution:

Shell

uv venv
source .venv/bin/activate
uv pip install rasa-pro==3.11.3

Don't forget to obtain your Rasa Pro Developer Edition License from the license request page. Export it as an environment variable:

Shell

export RASA_PRO_LICENSE=YOUR_LICENSE_KEY

Check the Rasa version to be sure

Shell

rasa --version

Rasa Pro Version  :         3.11.3
Minimum Compatible Version: 3.11.0rc1
Rasa SDK Version  :         3.11.0
Python Version    :         3.10.10
Operating System  :         macOS-15.2-arm64-arm-64bit
Python Path       :         /Users/souvikghosh/Documents/souvik-dev/.venv/bin/python3
License Expires   :         2025-11-18T00:00:00+00:00

You are all set!! let’s get started.

Setup Rasa Bot

We will download the basic CALM bot by simply running the below command:

Shell

rasa init --template tutorial

For this tutorial, we will work with the demo content of the bot as it is provided in the tutorial. We are mostly interested in making sure we can talk to the bot over voice.

Configuring Speech Services

The tutorial integrates two crucial speech services:

Deepgram for Automatic Speech Recognition (ASR)
Cartesia for Text-to-Speech (TTS)

After creating accounts with both services, set up your environment:

Shell

export DEEPGRAM_API_KEY=YOUR-DEEPGRAM-KEY
export CARTESIA_API_KEY=YOUR-CARTESIA-KEY

For additional ASR and TTS service options, consult the Rasa speech integrations documentation.

Integrating Voice Capabilities

The integration happens through the credentials.yml file. Here's the minimal configuration needed:

YAML

browser_audio:
  server_url: 0.0.0.0
  asr:
    name: deepgram
  tts:
    name: cartesia

This configuration tells the browser_audio channel to listen for streaming voice input, route speech recognition through Deepgram, and generate voice responses using Cartesia. Both services support extensive configuration options for voice selection, language preferences, and other parameters. You can explore these in the speech integration configuration documentation.

Running Your Voice-Enabled Bot

With the setup complete, you can now bring your bot to life:

Shell

rasa train
rasa inspect --voice

The inspector provides a development interface where you can test your voice interactions in real-time.

Next Steps

This tutorial walked you through creating a foundation for voice-enabled conversational AI. From here, you can experiment with different voice configurations, implement more complex dialogue patterns, fine-tune the latency and response characteristics, and add advanced error handling for voice interactions.

The combination of Rasa, Deepgram, and Cartesia provides a powerful platform for building sophisticated voice experiences that feel natural and responsive. Whether you're building a customer service bot or a virtual assistant, you now have the foundation for creating voice experiences that your users will want to interact with.

Remember to check the official documentation for each service for the latest updates and advanced configuration options. The integration possibilities between these platforms offer exciting opportunities for creating truly engaging conversational experiences.