Skip to main content

Speech Integrations

Audio Format

Rasa uses a common intermediate audio format called RasaAudioBytes that acts as a standard data format to prevent complexity between different channels, ASR engines, and TTS engines. Currently, this corresponds to:

  • Raw wave format
  • 8kHz sample rate
  • 8-bit depth
  • Mono channel
  • μ-law (mulaw) encoding

These parameters are not configurable. Rasa uses the library audioop-lts for conversion between audio encodings (functions like ulaw2lin() or lin2ulaw()).

Automatic Speech Recognition (ASR)

This section describes the supported integrations with Automatic Speech Recognition (ASR) or Speech To Text (STT) services.

Deepgram

Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can request a key from Deepgram. It can be configured in a Voice Stream channel as follows:

credentials.yml
browser_audio:
# ... other configuration
asr:
name: deepgram
Turn Detection

Deepgram uses two mechanisms to detect when a speaker has finished talking:

  1. Endpointing: Uses Voice Activity Detection (VAD) to detect silence after speech
  2. UtteranceEnd: Looks at word timings to detect gaps between words

The configuration parameters endpointing and utterance_end_ms below control these features respectively. For noisy environments, utterance_end_ms may be more reliable as it ignores non-speech audio. Read more on Deepgram Documentation

Configuration parameters

  • endpoint: Optional, defaults to api.deepgram.com - The endpoint URL for the Deepgram API.
  • endpointing: Optional, defaults to 400 - Number of milliseconds of silence to determine the end of speech.
  • language: Optional, defaults to en - The language code for the speech recognition.
  • model: Optional, defaults to nova-2-general - The model to be used for speech recognition.
  • smart_format: Optional, defaults to true - Boolean value to enable or disable Deepgram's smart formatting.
  • utterance_end_ms: Optional, defaults to 1000 - Time in milliseconds to wait before considering an utterance complete.

Azure

Requires the python library azure-cognitiveservices-speech. The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:

credentials.yml
browser_audio:
# ... other configuration
asr:
name: azure

Configuration parameters

  • language: Required. The language code for the speech recognition. (See Azure documentation for a list of languages).
  • speech_region: Optional, defaults to None - The region identifier for the Azure Speech service, such as westus. Ensure that the region matches the region of your subscription.
  • speech_endpoint: Optional, defaults to None - The service endpoint to connect to. You can use it when you have Azure Speech service behind a reverse proxy.
  • speech_host: Optional, defaults to None - The service host to connect to. Standard resource path will be assumed. Format is "protocol://host:port" where ":port" is optional.

While speech_region, speech_endpoint and speech_host are optional parameters. They cannot be all empty at the same time. In that case, speech_region is set to eastus.

When connecting to Azure Cloud, parameter speech_region is enough. Here is an example config,

browser_audio:
server_url: localhost
asr:
name: azure
language: de-DE
speech_region: germanywestcentral
tts:
name: azure
language: de-DE
voice: de-DE-KatjaNeural
speech_region: germanywestcentral

Others

Looking for integration with a different ASR service? You can create your own custom ASR component.

Text To Speech (TTS)

This section describes the supported integrations with Text To Speech (TTS) services.

Azure TTS

The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:

credentials.yml
browser_audio:
# ... other configuration
tts:
name: azure

Configuration parameters

  • language: Optional, defaults to en-US - The language code for the text-to-speech conversion. (See Azure documentation for a list of languages and voices).
  • voice: Optional, defaults to en-US-JennyNeural - The voice to be used for the text-to-speech conversion. Voice defines the specific characteristic of the voice, such as speaker's gender, age and speaking style.
  • timeout: Optional, defaults to 10 - The timeout duration in seconds for the text-to-speech request.
  • speech_region: Optional, defaults to None - The region identifier for the Azure Speech service. Ensure that the region matches the region of your subscription.
  • endpoint: Optional, defaults to None - The service endpoint for Azure Speech service.

Cartesia TTS

Use the environment variable CARTESIA_API_KEY for Cartesia API Key. The API Key requires a Cartesia account. It can be configured in a Voice Stream channel as follows,

credentials.yml
browser_audio:
# ... other configuration
tts:
name: cartesia

Configuration parameters

  • language: Optional, defaults to en - The language code for the text-to-speech conversion.
  • voice: Optional, defaults to 248be419-c632-4f23-adf1-5324ed7dbf1d - The id of the voice to use for text-to-speech conversion. The parameter will be passed to the Cartesia API as "voice": {"mode": "id","id": "VALUE"}
  • timeout: Optional, defaults to 10 - The timeout duration in seconds for the text-to-speech request.
  • model_id: Optional, defaults to sonic-english - The model ID to be used for the text-to-speech conversion.
  • version: Optional, defaults to 2024-06-10 - The version of the model to be used for the text-to-speech conversion.
  • endpoint: Optional, defaults to https://api.cartesia.ai/tts/sse - The endpoint URL for the Cartesia API.

Deepgram TTS

Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can request a key from Deepgram. It can be configured in a Voice Stream channel as follows:

credentials.yml
browser_audio:
# ... other configuration
tts:
name: deepgram

Configuration parameters

Deepgram does not use the parent class parameters of language or voice as each model is uniquely identified using the format [modelname]-[voicename]-[language].

  • model_id: Optional, defaults to aura-2-andromeda-en - The list of available options can be found in Deepgram Documentation.
  • endpoint: Optional, defaults to wss://api.deepgram.com/v1/speak - The endpoint URL for the Deepgram API.
  • timeout: Optional, defaults to 30 - The timeout duration in seconds for the text-to-speech request.

Others

Looking for integration with a different TTS service? You can create your own custom TTS component.

Custom ASR

You can implement your own custom ASR component as a Python class to integrate with any third-party speech recognition service. A custom ASR component must subclass the ASREngine class from rasa.core.channels.voice_stream.asr.asr_engine.

Your custom ASR component will receive audio in the RasaAudioBytes format and may need to convert it to your service's expected format.

Required Methods

Your custom ASR component must implement the following methods:

  • open_websocket_connection(): Establish a websocket connection to your ASR service
  • from_config_dict(config: Dict): Class method to create an instance from configuration dictionary
  • signal_audio_done(): Signal to the ASR service that audio input has ended
  • rasa_audio_bytes_to_engine_bytes(chunk: RasaAudioBytes): Convert Rasa audio format to your engine's expected format
  • engine_event_to_asr_event(event: Any): Convert your engine's events to Rasa's ASREvent format
  • get_default_config(): Static method that returns the default configuration for your component

Optional Methods

You may also override these methods as needed:

  • send_keep_alive(): Send keep-alive messages to maintain the connection. The default implementation is only a pass statement.
  • close_connection(): Custom cleanup when closing the connection. Default implementation is as follows,
async def close_connection(self) -> None:
if self.asr_socket:
await self.asr_socket.close()

ASR Events

Your engine_event_to_asr_event method should return appropriate ASREvent objects:

  • UserIsSpeaking(transcript): For interim/partial transcripts while the user is speaking
  • NewTranscript(transcript): For final transcripts when the user has finished speaking

See Configuration for details on how to configure your custom ASR component.

Example Implementation

Here's an example based on the Deepgram implementation structure:

custom_asr.py
import json
import os
from dataclasses import dataclass
from typing import Any, Dict, Optional
from urllib.parse import urlencode

import websockets
from websockets.legacy.client import WebSocketClientProtocol

from rasa.core.channels.voice_stream.asr.asr_engine import ASREngine, ASREngineConfig
from rasa.core.channels.voice_stream.asr.asr_event import (
ASREvent,
NewTranscript,
UserIsSpeaking,
)
from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes

@dataclass
class MyASRConfig(ASREngineConfig):
api_key: str = ""
endpoint: str = "wss://api.example.com/v1/speech"
language: str = "en-US"

class MyASR(ASREngine[MyASRConfig]):
required_env_vars = ("MY_ASR_API_KEY",) # Optional: required environment variables
required_packages = ("my_asr_package",) # Optional: required Python packages

def __init__(self, config: Optional[MyASRConfig] = None):
super().__init__(config)
self.accumulated_transcript = ""

async def open_websocket_connection(self) -> WebSocketClientProtocol:
"""Connect to the ASR system."""
api_key = os.environ["MY_ASR_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}
return await websockets.connect(
self._get_api_url_with_params(),
extra_headers=headers
)

def _get_api_url_with_params(self) -> str:
"""Build API URL with query parameters."""
query_params = {
"language": self.config.language,
"encoding": "mulaw",
"sample_rate": "8000",
"interim_results": "true"
}
return f"{self.config.endpoint}?{urlencode(query_params)}"

@classmethod
def from_config_dict(cls, config: Dict) -> "MyASR":
"""Create instance from configuration dictionary."""
asr_config = MyASRConfig.from_dict(config)
return cls(asr_config)

async def signal_audio_done(self) -> None:
"""Signal to the ASR service that audio input has ended."""
await self.asr_socket.send(json.dumps({"type": "stop_audio"}))

def rasa_audio_bytes_to_engine_bytes(self, chunk: RasaAudioBytes) -> bytes:
"""Convert Rasa audio format to engine format."""
# For most services, you can return the chunk directly
# since it's already in mulaw format
return chunk

def engine_event_to_asr_event(self, event: Any) -> Optional[ASREvent]:
"""Convert engine response to ASREvent."""
data = json.loads(event)

if data.get("type") == "transcript":
transcript = data.get("text", "")

if data.get("is_final"):
# Final transcript - user finished speaking
full_transcript = self.accumulated_transcript + " " + transcript
self.accumulated_transcript = ""
return NewTranscript(full_transcript.strip())
elif transcript:
# Interim transcript - user is still speaking
return UserIsSpeaking(transcript)

return None

@staticmethod
def get_default_config() -> MyASRConfig:
"""Get default configuration."""
return MyASRConfig(
endpoint="wss://api.example.com/v1/speech",
language="en-US"
)

async def send_keep_alive(self) -> None:
"""Send keep-alive message if supported by your service."""
if self.asr_socket is not None:
await self.asr_socket.send(json.dumps({"type": "keep_alive"}))

This structure allows you to integrate any speech recognition service with Rasa's voice capabilities while maintaining compatibility with the existing voice stream infrastructure.

Custom TTS

You can implement your own custom TTS component as a Python class to integrate with any third-party text-to-speech service. A custom TTS component must subclass the TTSEngine class from rasa.core.channels.voice_stream.tts.tts_engine.

Your custom TTS component must output audio in the RasaAudioBytes format and convert it using the engine_bytes_to_rasa_audio_bytes method.

Required Methods

Your custom TTS component must implement the following methods:

  • synthesize(text: str, config: Optional[T]): Generate speech from text, returning an async iterator of RasaAudioBytes chunks
  • engine_bytes_to_rasa_audio_bytes(chunk: bytes): Convert your engine's audio format to Rasa's audio format
  • from_config_dict(config: Dict): Class method to create an instance from configuration dictionary
  • get_default_config(): Static method that returns the default configuration for your component

Optional Methods

You may also override these methods as needed:

  • close_connection(): Custom cleanup when closing connections (e.g., closing websockets or HTTP sessions). Default implementation does not do anything.

See Configuration for details on how to configure your custom TTS component.

Example Implementation

Here's an example based on the Deepgram TTS implementation structure:

custom_tts.py
import os
from dataclasses import dataclass
from typing import AsyncIterator, Dict, Optional
from urllib.parse import urlencode

import aiohttp
from aiohttp import ClientTimeout, WSMsgType

from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes
from rasa.core.channels.voice_stream.tts.tts_engine import (
TTSEngine,
TTSEngineConfig,
TTSError,
)

@dataclass
class MyTTSConfig(TTSEngineConfig):
api_key: str = ""
endpoint: str = "wss://api.example.com/v1/speak"
model_id: str = "en-US-standard"

class MyTTS(TTSEngine[MyTTSConfig]):
session: Optional[aiohttp.ClientSession] = None
required_env_vars = ("MY_TTS_API_KEY",) # Optional: required environment variables
required_packages = ("aiohttp",) # Optional: required Python packages
ws: Optional[aiohttp.ClientWebSocketResponse] = None

def __init__(self, config: Optional[MyTTSConfig] = None):
super().__init__(config)
timeout = ClientTimeout(total=self.config.timeout)

@staticmethod
def get_request_headers(config: MyTTSConfig) -> dict[str, str]:
"""Build request headers with authentication."""
api_key = os.environ["MY_TTS_API_KEY"]
return {
"Authorization": f"Bearer {api_key}",
}

async def close_connection(self) -> None:
"""Close WebSocket connection if it exists."""
if self.ws and not self.ws.closed:
await self.ws.close()
self.ws = None

def get_websocket_url(self, config: MyTTSConfig) -> str:
"""Build WebSocket URL with query parameters."""
base_url = config.endpoint
query_params = {
"model": config.model_id,
"language": config.language,
"voice": config.voice,
"encoding": "mulaw",
"sample_rate": "8000",
}
return f"{base_url}?{urlencode(query_params)}"

async def synthesize(
self, text: str, config: Optional[MyTTSConfig] = None
) -> AsyncIterator[RasaAudioBytes]:
"""Generate speech from text using WebSocket TTS API."""
config = self.config.merge(config)
headers = self.get_request_headers(config)
ws_url = self.get_websocket_url(config)

try:
self.ws = await self.session.ws_connect(
ws_url,
headers=headers,
timeout=float(self.config.timeout),
)

# Send text to synthesize
await self.ws.send_json({
"type": "Speak",
"text": text,
})

# Signal that we're done sending text
await self.ws.send_json({"type": "Flush"})

# Stream audio chunks
async for msg in self.ws:
if msg.type == WSMsgType.BINARY:
# Binary data is the raw audio
yield self.engine_bytes_to_rasa_audio_bytes(msg.data)
elif msg.type == WSMsgType.TEXT:
# Check if stream is complete
if "Close" in msg.data or "Flushed" in msg.data:
break
elif msg.type in (WSMsgType.CLOSED, WSMsgType.ERROR):
break

# Send close message
if self.ws and not self.ws.closed:
await self.ws.send_json({"type": "Close"})
except Exception as e:
raise TTSError(f"Error during TTS synthesis: {e}")
finally:
# Ensure connection is closed
await self.close_connection()

def engine_bytes_to_rasa_audio_bytes(self, chunk: bytes) -> RasaAudioBytes:
"""Convert the generated TTS audio bytes into Rasa audio bytes."""
# If your service returns audio in mulaw format already, return directly
return RasaAudioBytes(chunk)

# If your service returns audio in a different format (e.g., PCM linear),
# you'll need to convert it. For example, to convert from linear PCM:
# import audioop
# mulaw_data = audioop.lin2ulaw(chunk, 2) # 2 = 16-bit samples
# return RasaAudioBytes(mulaw_data)

@staticmethod
def get_default_config() -> MyTTSConfig:
"""Get default configuration."""
return MyTTSConfig(
endpoint="wss://api.example.com/v1/speak",
model_id="en-US-standard",
language="en-US",
voice="female-1",
timeout=30,
)

@classmethod
def from_config_dict(cls, config: Dict) -> "MyTTS":
"""Create instance from configuration dictionary."""
return cls(MyTTSConfig.from_dict(config))

Configuration for Custom Components

To use a custom ASR or TTS component, you need to supply credentials for it in your credentials.yml file. The configuration should contain the module path of your custom class and any required configuration parameters.

The module path follows the format path.to.module.ClassName. For example:

  • A class MyASR in file addons/custom_asr.py has module path addons.custom_asr.MyASR
  • A class MyTTS in file addons/custom_tts.py has module path addons.custom_tts.MyTTS

Custom ASR Configuration Example

credentials.yml
browser_audio:
# ... other configuration
asr:
name: addons.custom_asr.MyASR
api_key: "your_api_key"
endpoint: "wss://api.example.com/v1/speech"
language: "en-US"
# any other custom parameters your ASR needs

Custom TTS Configuration Example

credentials.yml
browser_audio:
# ... other configuration
tts:
name: addons.custom_tts.MyTTS
api_key: "your_api_key"
endpoint: "wss://api.example.com/v1/speak"
language: "en-US"
voice: "en-US-JennyNeural"
timeout: 30
# any other custom parameters your TTS needs

Any custom parameters you define in your configuration class (e.g., MyASRConfig or MyTTSConfig) can be passed through the credentials file and will be available in your component via self.config.