Training your own word embeddings can be a hassle, and we typically advise against it. It can require a lot of computing power, time, and a sizable dataset. Despite this, there are some very good reasons to train your own embeddings, and we've decided to add support for custom embeddings via gensim in rasa-nlu-examples. It won't be a relevant feature to everyone but we hope it will help some in our community who otherwise might not have access to useful word embeddings.
So, why train your own word embeddings?
No Word Embeddings Available
There are many embeddings available for languages like English, Chinese, German, French, and Spanish. For many other languages, availability can vary. Some projects try to address this. There's FastText, which covers 157 languages, or BytePair embeddings, which include 275 languages. That's a lot of languages, but certainly not all of them. There are over 7000 languages out there, which means some users might want to train their own word embeddings simply because they do not exist yet.
Low Quality of Embeddings
Even when word embeddings do exist for a language, they may still not be of high quality. If a word embedding is trained exclusively on Wikipedia, you might wonder if it suits your need for an AI assistant. Wikipedia articles typically feature formal texts with facts from the past. They don't offer text that represents a conversation that is currently happening between two people. It's therefore plausible that the word embeddings trained on Wikipedia won't 100% apply to an AI assistant that you're designing. It might make sense to train on a corpus of customer support transcripts instead.
Specialized Domain
Even if your assistant is in English and can use high-quality pre-trained embeddings, you might still want to train your own. If your use case is very domain-specific and contains a lot of jargon (for example, legal, medical, or related to sports), you might also need to train your own embeddings on your own dataset because general word embeddings might not cover enough ground for your use case.
Solution
This is why we've added support for gensim in our rasa-nlu-examples project. Gensim is a python package with many algorithms for natural language processing. One of the many features is the ability to train your own embeddings in just a few lines of code. The code below demonstrates this.
from gensim.models import Word2Vec
# Gensim needs a list of lists to represent tokens in a document.
# In real life you'd read a text file and turn them into lists here.
text = ["this is a sentence", "so is this", "and we're all talking"]
tokens = [t.split(" ") for t in text]
# This is where we train new word embeddings.
model = Word2Vec(sentences=tokens, size=10, window=3,
min_count=1, iter=5, workers=2)
# This is where they are saved to disk.
model.wv.save("wordvectors.kv")
This code will train embeddings and then save them to disk. The wordvectors.kv file represents the trained embeddings which can be re-used for whatever use case you'd like. In our case, we've added support to use them in Rasa. To use them you'll need to install the rasa-nlu-examples project and configure config.yml to read this file.
language: en
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: rasa_nlu_examples.featurizers.dense.GensimFeaturizer
cache_dir: path # This is the path where the word vectors are saved.
file: wordvectors.kv # This is the file we just created.
- name: DIETClassifier
epochs: 100
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
Conclusion
Training custom embeddings isn't right for everyone, but if you're building an assistant in a language that does not have pre-trained word embeddings or for a niche use case, it could provide a solution. This feature is experimental, but we hope it gives our users more freedom in their Rasa NLU pipelines.
Our vision for rasa-nlu-examples is that we can offer more experimental features to our community. It's a contrib-like library where the community is also welcome to contribute. For example, the BytePairFeaturizer (which features embeddings for 275 languages!) was contributed by a community member. If you find any of these features to be useful, we'd love to hear from you. If there are any issues, questions or feature requests, you can also always ask us anything on our forum or on GitHub.