Speech Integrations
Automatic Speech Recognition (ASR)
This section describes the supported integrations with Automatic Speech Recognition (ASR) or Speech To Text (STT) services.
Rasa uses mulaw
encoding with an 8000 Hz Sample rate, these parameters are not configurable.
Deepgram
Use the environment variable DEEPGRAM_API_KEY
for Deepgram API Key. You can
request a key from Deepgram. It can be configured in a Voice Stream channel
as follows:
browser_audio:
# ... other configuration
asr:
name: deepgram
Deepgram uses two mechanisms to detect when a speaker has finished talking:
- Endpointing: Uses Voice Activity Detection (VAD) to detect silence after speech
- UtteranceEnd: Looks at word timings to detect gaps between words
The configuration parameters endpointing
and utterance_end_ms
below control these features respectively. For noisy environments, utterance_end_ms
may be more reliable as it ignores non-speech audio. Read more on Deepgram Documentation
Configuration parameters
endpoint
: Optional, defaults toapi.deepgram.com
- The endpoint URL for the Deepgram API.endpointing
: Optional, defaults to400
- Number of milliseconds of silence to determine the end of speech.language
: Optional, defaults toen
- The language code for the speech recognition.model
: Optional, defaults tonova-2-general
- The model to be used for speech recognition.smart_format
: Optional, defaults totrue
- Boolean value to enable or disable Deepgram's smart formatting.utterance_end_ms
: Optional, defaults to1000
- Time in milliseconds to wait before considering an utterance complete.
Azure
Requires the python library azure-cognitiveservices-speech
. The API Key can be set with the environment variable AZURE_SPEECH_API_KEY
.
Sample configuration looks as follow:
browser_audio:
# ... other configuration
asr:
name: azure
Configuration parameters
language
: Required. The language code for the speech recognition. (See Azure documentation for a list of languages).speech_region
: Optional, defaults toNone
- The region identifier for the Azure Speech service, such aswestus
. Ensure that the region matches the region of your subscription.speech_endpoint
: Optional, defaults toNone
- The service endpoint to connect to. You can use it when you have Azure Speech service behind a reverse proxy.speech_host
: Optional, defaults toNone
- The service host to connect to. Standard resource path will be assumed. Format is "protocol://host:port" where ":port" is optional.
While speech_region
, speech_endpoint
and speech_host
are optional parameters. They
cannot be all empty at the same time. In that case, speech_region is set to eastus
.
When connecting to Azure Cloud, parameter speech_region
is enough. Here is an example config,
browser_audio:
server_url: localhost
asr:
name: azure
language: de-DE
speech_region: germanywestcentral
tts:
name: azure
language: de-DE
voice: de-DE-KatjaNeural
speech_region: germanywestcentral
Text To Speech (TTS)
This section describes the supported integrations with Text To Speech (TTS) services.
Azure TTS
The API Key can be set with the environment variable AZURE_SPEECH_API_KEY
. Sample configuration looks as follow:
browser_audio:
# ... other configuration
tts:
name: azure
Configuration parameters
language
: Optional, defaults toen-US
- The language code for the text-to-speech conversion. (See Azure documentation for a list of languages and voices).voice
: Optional, defaults toen-US-JennyNeural
- The voice to be used for the text-to-speech conversion. Voice defines the specific characteristic of the voice, such as speaker's gender, age and speaking style.timeout
: Optional, defaults to10
- The timeout duration in seconds for the text-to-speech request.speech_region
: Optional, defaults toNone
- The region identifier for the Azure Speech service. Ensure that the region matches the region of your subscription.endpoint
: Optional, defaults toNone
- The service endpoint for Azure Speech service.
Cartesia TTS
Use the environment variable CARTESIA_API_KEY
for Cartesia API Key. The API Key
requires a Cartesia account. It can be configured in a Voice Stream channel as follows,
browser_audio:
# ... other configuration
tts:
name: cartesia
Configuration parameters
language
: Optional, defaults toen
- The language code for the text-to-speech conversion.voice
: Optional, defaults to248be419-c632-4f23-adf1-5324ed7dbf1d
- Theid
of the voice to use for text-to-speech conversion. The parameter will be passed to the Cartesia API as"voice": {"mode": "id","id": "VALUE"}
timeout
: Optional, defaults to10
- The timeout duration in seconds for the text-to-speech request.model_id
: Optional, defaults tosonic-english
- The model ID to be used for the text-to-speech conversion.version
: Optional, defaults to2024-06-10
- The version of the model to be used for the text-to-speech conversion.endpoint
: Optional, defaults tohttps://api.cartesia.ai/tts/sse
- The endpoint URL for the Cartesia API.