notice
This is unreleased documentation for Rasa Documentation Main/Unreleased version.
For the latest released documentation, see the latest version (3.x).
Language Support
You can use Rasa to build assistants in any language you want.
Your Rasa assistant can be used on training data in any language. If there are no word embeddings for your language, you can train your featurizers from scratch with the data you provide.
In addition, we also support pre-trained word embeddings such as spaCy. For information on what pipeline is best for your use case, check out choosing a pipeline.
Training a Model in Any Languages
The following pipeline can be used to train models in whitespace tokenizable languages:
To train a Rasa model in your preferred language, define the pipeline in your config.yml
.
After you define the pipeline and generate some NLU training data
in your chosen language, train the model by running the command:
Once the training is finished, you can test your model's language skills. See how your model interprets different input messages by running:
note
Even more so when training word embeddings from scratch, more training data will lead to a better model! If you find your model is having trouble discerning your inputs, try training with more example sentences.
Using Pre-trained Language Models
If you can find them in your language, language models with pre-trained word vectors are a great way to get started with less data, as the word vectors are trained on large amounts of data such as Wikipedia.
spaCy
With the Pre-trained Spacy Embeddings, you can use spaCy's
pre-trained language models or load fastText vectors, which are available
for hundreds of languages. If you want
to incorporate a custom model you've found into spaCy, check out their page on
adding languages. As described in the documentation, you need to
register your language model and link it to the language identifier, which will allow Rasa to load and use your new language
by passing in your language identifier as the language
option.
MITIE
You can also pre-train your own word vectors from a language corpus using MITIE. To do so:
- Get a clean language corpus (a Wikipedia dump works) as a set of text files.
- Build and run
MITIE Wordrep Tool
_ on your corpus. This can take several hours/days depending on your dataset and your workstation. You'll need something like 128GB of RAM for wordrep to run -- yes, that's a lot: try to extend your swap. - Set the path of your new
total_word_feature_extractor.dat
as themodel
parameter in your configuration.
For a full example of how to train MITIE word vectors, check out this blogpost of creating a MITIE model from a Chinese Wikipedia dump.