Skip to main content

Speech Integrations

Automatic Speech Recognition (ASR)

This section describes the supported integrations with Automatic Speech Recognition (ASR) or Speech To Text (STT) services. Rasa uses mulaw encoding with an 8000 Hz Sample rate, these parameters are not configurable.

Deepgram

Use the environment variable DEEPGRAM_API_KEY for Deepgram API Key. You can request a key from Deepgram. It can be configured in a Voice Stream channel as follows:

credentials.yml
browser_audio:
# ... other configuration
asr:
name: deepgram
Turn Detection

Deepgram uses two mechanisms to detect when a speaker has finished talking:

  1. Endpointing: Uses Voice Activity Detection (VAD) to detect silence after speech
  2. UtteranceEnd: Looks at word timings to detect gaps between words

The configuration parameters endpointing and utterance_end_ms below control these features respectively. For noisy environments, utterance_end_ms may be more reliable as it ignores non-speech audio. Read more on Deepgram Documentation

Configuration parameters

  • endpoint: Optional, defaults to api.deepgram.com - The endpoint URL for the Deepgram API.
  • endpointing: Optional, defaults to 400 - Number of milliseconds of silence to determine the end of speech.
  • language: Optional, defaults to en - The language code for the speech recognition.
  • model: Optional, defaults to nova-2-general - The model to be used for speech recognition.
  • smart_format: Optional, defaults to true - Boolean value to enable or disable Deepgram's smart formatting.
  • utterance_end_ms: Optional, defaults to 1000 - Time in milliseconds to wait before considering an utterance complete.

Azure

Requires the python library azure-cognitiveservices-speech. The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:

credentials.yml
browser_audio:
# ... other configuration
asr:
name: azure

Configuration parameters

  • language: Required. The language code for the speech recognition. (See Azure documentation for a list of languages).
  • speech_region: Optional, defaults to None - The region identifier for the Azure Speech service, such as westus. Ensure that the region matches the region of your subscription.
  • speech_endpoint: Optional, defaults to None - The service endpoint to connect to. You can use it when you have Azure Speech service behind a reverse proxy.
  • speech_host: Optional, defaults to None - The service host to connect to. Standard resource path will be assumed. Format is "protocol://host:port" where ":port" is optional.

While speech_region, speech_endpoint and speech_host are optional parameters. They cannot be all empty at the same time. In that case, speech_region is set to eastus.

When connecting to Azure Cloud, parameter speech_region is enough. Here is an example config,

browser_audio:
server_url: localhost
asr:
name: azure
language: de-DE
speech_region: germanywestcentral
tts:
name: azure
language: de-DE
voice: de-DE-KatjaNeural
speech_region: germanywestcentral

Text To Speech (TTS)

This section describes the supported integrations with Text To Speech (TTS) services.

Azure TTS

The API Key can be set with the environment variable AZURE_SPEECH_API_KEY. Sample configuration looks as follow:

credentials.yml
browser_audio:
# ... other configuration
tts:
name: azure

Configuration parameters

  • language: Optional, defaults to en-US - The language code for the text-to-speech conversion. (See Azure documentation for a list of languages and voices).
  • voice: Optional, defaults to en-US-JennyNeural - The voice to be used for the text-to-speech conversion. Voice defines the specific characteristic of the voice, such as speaker's gender, age and speaking style.
  • timeout: Optional, defaults to 10 - The timeout duration in seconds for the text-to-speech request.
  • speech_region: Optional, defaults to None - The region identifier for the Azure Speech service. Ensure that the region matches the region of your subscription.
  • endpoint: Optional, defaults to None - The service endpoint for Azure Speech service.

Cartesia TTS

Use the environment variable CARTESIA_API_KEY for Cartesia API Key. The API Key requires a Cartesia account. It can be configured in a Voice Stream channel as follows,

credentials.yml
browser_audio:
# ... other configuration
tts:
name: cartesia

Configuration parameters

  • language: Optional, defaults to en - The language code for the text-to-speech conversion.
  • voice: Optional, defaults to 248be419-c632-4f23-adf1-5324ed7dbf1d - The id of the voice to use for text-to-speech conversion. The parameter will be passed to the Cartesia API as "voice": {"mode": "id","id": "VALUE"}
  • timeout: Optional, defaults to 10 - The timeout duration in seconds for the text-to-speech request.
  • model_id: Optional, defaults to sonic-english - The model ID to be used for the text-to-speech conversion.
  • version: Optional, defaults to 2024-06-10 - The version of the model to be used for the text-to-speech conversion.
  • endpoint: Optional, defaults to https://api.cartesia.ai/tts/sse - The endpoint URL for the Cartesia API.