Speech Integrations
Audio Format
Rasa uses a common intermediate audio format called RasaAudioBytes that
acts as a standard data format to prevent complexity between different
channels, ASR engines, and TTS engines. Currently, this corresponds to:
- Raw wave format
- 8kHz sample rate
- 8-bit depth
- Mono channel
- μ-law (mulaw) encoding
These parameters are not configurable. Rasa uses the library audioop-lts for
conversion between audio encodings (functions like ulaw2lin() or lin2ulaw()).
Automatic Speech Recognition (ASR)
This section describes the supported integrations with Automatic Speech Recognition (ASR) or Speech To Text (STT) services.
Deepgram
Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can
request a key from Deepgram. It can be configured in a Voice Stream channel
as follows:
browser_audio:
# ... other configuration
asr:
name: deepgram
Deepgram uses two mechanisms to detect when a speaker has finished talking:
- Endpointing: Uses Voice Activity Detection (VAD) to detect silence after speech
- UtteranceEnd: Looks at word timings to detect gaps between words
The configuration parameters endpointing and utterance_end_ms below control these features respectively. For noisy environments, utterance_end_ms may be more reliable as it ignores non-speech audio. Read more on Deepgram Documentation
Configuration parameters
endpoint: Optional, defaults toapi.deepgram.com- The endpoint URL for the Deepgram API.endpointing: Optional, defaults to400- Number of milliseconds of silence to determine the end of speech.language: Optional, defaults toen- The language code for the speech recognition.model: Optional, defaults tonova-2-general- The model to be used for speech recognition.smart_format: Optional, defaults totrue- Boolean value to enable or disable Deepgram's smart formatting.utterance_end_ms: Optional, defaults to1000- Time in milliseconds to wait before considering an utterance complete.
Azure
Requires the python library azure-cognitiveservices-speech. The API Key can be set with the environment variable AZURE_SPEECH_API_KEY.
Sample configuration looks as follow:
browser_audio:
# ... other configuration
asr:
name: azure
Configuration parameters
language: Required. The language code for the speech recognition. (See Azure documentation for a list of languages).speech_region: Optional, defaults toNone- The region identifier for the Azure Speech service, such aswestus. Ensure that the region matches the region of your subscription.speech_endpoint: Optional, defaults toNone- The service endpoint to connect to. You can use it when you have Azure Speech service behind a reverse proxy.speech_host: Optional, defaults toNone- The service host to connect to. Standard resource path will be assumed. Format is "protocol://host:port" where ":port" is optional.
While speech_region, speech_endpoint and speech_host are optional parameters. They
cannot be all empty at the same time. In that case, speech_region is set to eastus.
When connecting to Azure Cloud, parameter speech_region is enough. Here is an example config,
browser_audio:
server_url: localhost
asr:
name: azure
language: de-DE
speech_region: germanywestcentral
tts:
name: azure
language: de-DE
voice: de-DE-KatjaNeural
speech_region: germanywestcentral
Others
Looking for integration with a different ASR service? You can create your own custom ASR component.
Text To Speech (TTS)
This section describes the supported integrations with Text To Speech (TTS) services.
Azure TTS
The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:
browser_audio:
# ... other configuration
tts:
name: azure
Configuration parameters
language: Optional, defaults toen-US- The language code for the text-to-speech conversion. (See Azure documentation for a list of languages and voices).voice: Optional, defaults toen-US-JennyNeural- The voice to be used for the text-to-speech conversion. Voice defines the specific characteristic of the voice, such as speaker's gender, age and speaking style.timeout: Optional, defaults to10- The timeout duration in seconds for the text-to-speech request.speech_region: Optional, defaults toNone- The region identifier for the Azure Speech service. Ensure that the region matches the region of your subscription.endpoint: Optional, defaults toNone- The service endpoint for Azure Speech service.
Cartesia TTS
Use the environment variable CARTESIA_API_KEY for Cartesia API Key. The API Key
requires a Cartesia account. It can be configured in a Voice Stream channel as follows,
browser_audio:
# ... other configuration
tts:
name: cartesia
Configuration parameters
language: Optional, defaults toen- The language code for the text-to-speech conversion.voice: Optional, defaults to248be419-c632-4f23-adf1-5324ed7dbf1d- Theidof the voice to use for text-to-speech conversion. The parameter will be passed to the Cartesia API as"voice": {"mode": "id","id": "VALUE"}timeout: Optional, defaults to10- The timeout duration in seconds for the text-to-speech request.model_id: Optional, defaults tosonic-english- The model ID to be used for the text-to-speech conversion.version: Optional, defaults to2024-06-10- The version of the model to be used for the text-to-speech conversion.endpoint: Optional, defaults tohttps://api.cartesia.ai/tts/sse- The endpoint URL for the Cartesia API.
Deepgram TTS
Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can
request a key from Deepgram. It can be configured in a Voice Stream channel
as follows:
browser_audio:
# ... other configuration
tts:
name: deepgram
Configuration parameters
Deepgram does not use the parent class parameters of language or voice as each
model is uniquely identified using the format [modelname]-[voicename]-[language].
model_id: Optional, defaults toaura-2-andromeda-en- The list of available options can be found in Deepgram Documentation.endpoint: Optional, defaults towss://api.deepgram.com/v1/speak- The endpoint URL for the Deepgram API.timeout: Optional, defaults to30- The timeout duration in seconds for the text-to-speech request.
Rime TTS
Use the environment variable RIME_API_KEY for Rime API Key. You can
request a key from Rime. It can be configured in a Voice Stream channel
as follows:
browser_audio:
# ... other configuration
tts:
name: rime
Configuration parameters
speaker: Optional, defaults tocove- The speaker voice to use for text-to-speech conversion.model_id: Optional, defaults tomistv2- The model ID to be used for text-to-speech conversion.endpoint: Optional, defaults towss://users.rime.ai/ws2- The endpoint URL for the Rime API.speed_alpha: Optional, defaults to1.0- Controls the speed of speech synthesis.segment: Optional, defaults toimmediate- Segment mode for synthesis. Use "immediate" for low latency.timeout: Optional, defaults to30- The timeout duration in seconds for the text-to-speech request.
Others
Looking for integration with a different TTS service? You can create your own custom TTS component.
Custom ASR
You can implement your own custom ASR component as a Python class to
integrate with any third-party speech recognition service. A custom ASR
component must subclass the ASREngine class from rasa.core.channels.voice_stream.asr.asr_engine.
Your custom ASR component will receive audio in the RasaAudioBytes format and may need to convert it to your service's expected format.
Required Methods
Your custom ASR component must implement the following methods:
open_websocket_connection(): Establish a websocket connection to your ASR servicefrom_config_dict(config: Dict): Class method to create an instance from configuration dictionarysignal_audio_done(): Signal to the ASR service that audio input has endedrasa_audio_bytes_to_engine_bytes(chunk: RasaAudioBytes): Convert Rasa audio format to your engine's expected formatengine_event_to_asr_event(event: Any): Convert your engine's events to Rasa'sASREventformatget_default_config(): Static method that returns the default configuration for your component
Optional Methods
You may also override these methods as needed:
send_keep_alive(): Send keep-alive messages to maintain the connection. The default implementation is only apassstatement.close_connection(): Custom cleanup when closing the connection. Default implementation is as follows,
async def close_connection(self) -> None:
if self.asr_socket:
await self.asr_socket.close()
ASR Events
Your engine_event_to_asr_event method should return appropriate ASREvent objects:
UserIsSpeaking(transcript): For interim/partial transcripts while the user is speakingNewTranscript(transcript): For final transcripts when the user has finished speaking
See Configuration for details on how to configure your custom ASR component.
Example Implementation
Here's an example based on the Deepgram implementation structure:
import json
import os
from dataclasses import dataclass
from typing import Any, Dict, Optional
from urllib.parse import urlencode
import websockets
from websockets.legacy.client import WebSocketClientProtocol
from rasa.core.channels.voice_stream.asr.asr_engine import ASREngine, ASREngineConfig
from rasa.core.channels.voice_stream.asr.asr_event import (
ASREvent,
NewTranscript,
UserIsSpeaking,
)
from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes
@dataclass
class MyASRConfig(ASREngineConfig):
api_key: str = ""
endpoint: str = "wss://api.example.com/v1/speech"
language: str = "en-US"
class MyASR(ASREngine[MyASRConfig]):
required_env_vars = ("MY_ASR_API_KEY",) # Optional: required environment variables
required_packages = ("my_asr_package",) # Optional: required Python packages
def __init__(self, config: Optional[MyASRConfig] = None):
super().__init__(config)
self.accumulated_transcript = ""
async def open_websocket_connection(self) -> WebSocketClientProtocol:
"""Connect to the ASR system."""
api_key = os.environ["MY_ASR_API_KEY"]
headers = {"Authorization": f"Bearer {api_key}"}
return await websockets.connect(
self._get_api_url_with_params(),
extra_headers=headers
)
def _get_api_url_with_params(self) -> str:
"""Build API URL with query parameters."""
query_params = {
"language": self.config.language,
"encoding": "mulaw",
"sample_rate": "8000",
"interim_results": "true"
}
return f"{self.config.endpoint}?{urlencode(query_params)}"
@classmethod
def from_config_dict(cls, config: Dict) -> "MyASR":
"""Create instance from configuration dictionary."""
asr_config = MyASRConfig.from_dict(config)
return cls(asr_config)
async def signal_audio_done(self) -> None:
"""Signal to the ASR service that audio input has ended."""
await self.asr_socket.send(json.dumps({"type": "stop_audio"}))
def rasa_audio_bytes_to_engine_bytes(self, chunk: RasaAudioBytes) -> bytes:
"""Convert Rasa audio format to engine format."""
# For most services, you can return the chunk directly
# since it's already in mulaw format
return chunk
def engine_event_to_asr_event(self, event: Any) -> Optional[ASREvent]:
"""Convert engine response to ASREvent."""
data = json.loads(event)
if data.get("type") == "transcript":
transcript = data.get("text", "")
if data.get("is_final"):
# Final transcript - user finished speaking
full_transcript = self.accumulated_transcript + " " + transcript
self.accumulated_transcript = ""
return NewTranscript(full_transcript.strip())
elif transcript:
# Interim transcript - user is still speaking
return UserIsSpeaking(transcript)
return None
@staticmethod
def get_default_config() -> MyASRConfig:
"""Get default configuration."""
return MyASRConfig(
endpoint="wss://api.example.com/v1/speech",
language="en-US"
)
async def send_keep_alive(self) -> None:
"""Send keep-alive message if supported by your service."""
if self.asr_socket is not None:
await self.asr_socket.send(json.dumps({"type": "keep_alive"}))
This structure allows you to integrate any speech recognition service with Rasa's voice capabilities while maintaining compatibility with the existing voice stream infrastructure.
Custom TTS
You can implement your own custom TTS component as a Python class to
integrate with any third-party text-to-speech service. A custom TTS component
must subclass the TTSEngine class from
rasa.core.channels.voice_stream.tts.tts_engine.
Your custom TTS component must output audio in the RasaAudioBytes format
and convert it using the engine_bytes_to_rasa_audio_bytes method.
Required Methods
Your custom TTS component must implement the following methods:
synthesize(text: str, config: Optional[T]): Generate speech from text, returning an async iterator ofRasaAudioByteschunksengine_bytes_to_rasa_audio_bytes(chunk: bytes): Convert your engine's audio format to Rasa's audio formatfrom_config_dict(config: Dict): Class method to create an instance from configuration dictionaryget_default_config(): Static method that returns the default configuration for your component
Optional Methods
You may also override these methods as needed:
connect(config: Optional[T]): Establish connection to the TTS engine if necessary. Default implementation does nothing.close_connection(): Custom cleanup when closing connections (e.g., closing websockets or HTTP sessions). Default implementation does nothing.send_text_chunk(text: str): Send text chunks to the TTS system for streaming synthesis. Used withstream_audio()for real-time streaming.signal_text_done(): Signal TTS engine to process any remaining buffered text and prepare to end the stream.stream_audio(): Stream audio output from the TTS engine. Continuously yields audio chunks as they are produced by the engine.
Streaming vs Non-Streaming TTS
The TTSEngine supports both streaming and non-streaming modes:
- Non-streaming: Implement only
synthesize()method for simple request-response synthesis - Streaming: Set
streaming_input = Trueand implementsend_text_chunk(),signal_text_done(), andstream_audio()methods for real-time streaming synthesis
Required Class Attributes
You can optionally define these class attributes:
required_env_vars: Tuple of required environment variable namesrequired_packages: Tuple of required Python package namesstreaming_input: Boolean indicating if the engine supports streaming input (defaults toFalse)
See Configuration for details on how to configure your custom TTS component.
Example Implementation
Here's an example of an implementation for TTS streaming:
import os
from dataclasses import dataclass
from typing import AsyncIterator, Dict, Optional
from urllib.parse import urlencode
import aiohttp
from aiohttp import ClientTimeout, WSMsgType
from rasa.core.channels.voice_stream.audio_bytes import RasaAudioBytes
from rasa.core.channels.voice_stream.tts.tts_engine import (
TTSEngine,
TTSEngineConfig,
TTSError,
)
@dataclass
class MyTTSConfig(TTSEngineConfig):
endpoint: str = "wss://api.example.com/v1/speak"
model_id: str = "en-US-standard"
class MyTTS(TTSEngine[MyTTSConfig]):
session: Optional[aiohttp.ClientSession] = None
required_env_vars = ("MY_TTS_API_KEY",) # Optional: required environment variables
required_packages = ("aiohttp",) # Optional: required Python packages
streaming_input = True
ws: Optional[aiohttp.ClientWebSocketResponse] = None
def __init__(self, config: Optional[MyTTSConfig] = None):
super().__init__(config)
timeout = ClientTimeout(total=self.config.timeout)
# All class instances share the same session
if self.__class__.session is None or self.__class__.session.closed:
self.__class__.session = aiohttp.ClientSession(timeout=timeout)
async def connect(self, config: Optional[MyTTSConfig] = None) -> None:
"""Establish WebSocket connection to the TTS engine."""
headers = {
"Authorization": f"Bearer {os.environ['MY_TTS_API_KEY']}",
}
query_params = {
"model": merged_config.model_id,
"language": merged_config.language,
"voice": merged_config.voice,
"encoding": "mulaw",
"sample_rate": "8000",
}
ws_url = f"{merged_config.endpoint}?{urlencode(query_params)}"
try:
self.ws = await self.session.ws_connect(
ws_url,
headers=headers,
timeout=float(self.config.timeout) if self.config.timeout else 30,
)
except Exception as e:
raise TTSError(f"Failed to connect to TTS service: {e}")
async def close_connection(self) -> None:
"""Close WebSocket connection if it exists."""
if self.ws and not self.ws.closed:
await self.ws.close()
self.ws = None
async def send_text_chunk(self, text: str) -> None:
"""Send text chunk to TTS engine for streaming synthesis."""
await self.ws.send_json({"text": text})
async def signal_text_done(self) -> None:
"""Signal TTS engine that all text has been sent."""
await self.ws.send_json({"operation": "flush"})
async def stream_audio(self) -> AsyncIterator[RasaAudioBytes]:
"""Stream audio output from the TTS engine."""
try:
async for msg in self.ws:
if msg.type == WSMsgType.TEXT:
data = msg.json()
if data.get("type") == "audio":
# Assume base64 encoded audio data
import base64
audio_bytes = base64.b64decode(data.get("data", ""))
if audio_bytes:
yield self.engine_bytes_to_rasa_audio_bytes(audio_bytes)
elif data.get("type") == "error":
raise TTSError(f"TTS error: {data.get('message')}")
elif msg.type == WSMsgType.CLOSED:
break
elif msg.type == WSMsgType.ERROR:
raise TTSError("WebSocket error during audio streaming")
except Exception as e:
raise TTSError(f"Error during audio streaming: {e}")
async def synthesize(
self, text: str, config: Optional[MyTTSConfig] = None
) -> AsyncIterator[RasaAudioBytes]:
"""Generate speech from text using streaming WebSocket."""
await self.connect(config)
try:
await self.send_text_chunk(text)
await self.signal_text_done()
async for audio_chunk in self.stream_audio():
yield audio_chunk
finally:
await self.close_connection()
def engine_bytes_to_rasa_audio_bytes(self, chunk: bytes) -> RasaAudioBytes:
"""Convert the generated TTS audio bytes into Rasa audio bytes."""
# If your service returns audio in mulaw format already, return directly
return RasaAudioBytes(chunk)
# If your service returns audio in a different format (e.g., PCM linear),
# you'll need to convert it. For example, to convert from linear PCM:
# import audioop
# mulaw_data = audioop.lin2ulaw(chunk, 2) # 2 = 16-bit samples
# return RasaAudioBytes(mulaw_data)
@staticmethod
def get_default_config() -> MyTTSConfig:
return MyTTSConfig(
endpoint="wss://api.example.com/v1/speak",
model_id="en-US-standard",
language="en-US",
voice="female-1",
timeout=30,
)
@classmethod
def from_config_dict(cls, config: Dict) -> "MyTTS":
return cls(MyTTSConfig.from_dict(config))
If your TTS service doesn't support continuous input streaming (i.e., you need to send all text at once), set streaming_input = False and only implement the synthesize() method. You can skip implementing connect(), send_text_chunk(), signal_text_done(), and stream_audio() methods. The synthesize() method should handle the entire synthesis process from text to audio.
Configuration for Custom Components
To use a custom ASR or TTS component, you need to supply credentials for it in your credentials.yml file. The configuration should contain the module path of your custom class and any required configuration parameters.
The module path follows the format path.to.module.ClassName. For example:
- A class
MyASRin fileaddons/custom_asr.pyhas module pathaddons.custom_asr.MyASR - A class
MyTTSin fileaddons/custom_tts.pyhas module pathaddons.custom_tts.MyTTS
Custom ASR Configuration Example
browser_audio:
# ... other configuration
asr:
name: addons.custom_asr.MyASR
api_key: "your_api_key"
endpoint: "wss://api.example.com/v1/speech"
language: "en-US"
# any other custom parameters your ASR needs
Custom TTS Configuration Example
browser_audio:
# ... other configuration
tts:
name: addons.custom_tts.MyTTS
api_key: "your_api_key"
endpoint: "wss://api.example.com/v1/speak"
language: "en-US"
voice: "en-US-JennyNeural"
timeout: 30
# any other custom parameters your TTS needs
Any custom parameters you define in your configuration class (e.g., MyASRConfig or MyTTSConfig) can be passed through the credentials file and will be available in your component via self.config.