Skip to content

January 30th, 2023

Revisiting the Rasa Learning Center III: NLP for Developers

  • portrait of Sonam Pankaj

    Sonam Pankaj

“NLP for Developers” in the Rasa Learning Center is a quick and friendly introduction to modern NLP tools and methods, such as tokenization, word embeddings, and transfer learning. If you are a software developer, manager, or anyone who has not been through all the machine learning concepts for Rasa NLU, it’s the best place to get an overview.

Word embeddings:

Word embedding is a numeric vector input that represents a word. It is used as the input to machine learning models, visualizing usage patterns in the training corpus.

Benefits of word embeddings:
Captures an approximation of meaning
Faster to build and update than hand-built models
Standard approach for NLP

It cannot distinguish between homophones.
It can be memory intensive.
It is corpus dependent.

for more details, click here


There are several neural network architectures, like convolution neural networks and RNNs.
Transformer is one such architecture. It aims to solve sequence-to-sequence tasks. It works on attention mechanisms, where self-attention is computed for each word in the input sequence to the output sequence. It has two parts, an encoder and a decoder. The encoder takes sequences and transforms them into embeddings, and the decoder takes these embeddings and makes the output sequences. Used in sequence-to-sequence tasks like machine translation,

Why use attention or benefits of transformer:
Faster training time due to parallelization
Attention ignores order or the importance of words with approximation in a sentence.

Huge models that require a lot of memory and computing to train

for more details, click here


Breaking language input into smaller, combinable pieces.

Types of tokenization:
White space tokenization: For breaking text apart based on white spaces
Dictionary-based tokenization. For languages without white space between tokens, it requires hand built dictionary.
Subword tokenization: The tokens are learned by finding repeated characters in the training dataset. e.g., converRT tokenizer

A standard approach in NLP
Smaller pieces are more likely to be repeated and, thus, easier to learn.
Allows you to use methods built around words.

Depending on the language, you must create or improve tokenization.
New tokenization methods tend to be labor and data-intensive, which is bad for low-resource language.

for more details, click here


Language models are a probabilistic distribution over a sequence of words. Bert is a Masked Language Model which produces contextual word embedding; now homophones have different representations if they mean other things.

Bert is a large language model with 340 million parameters in Bert-large and 110 million in Bert-base. The architecture takes multiple encoders, stacks them together, takes the embedding of the last layer, and uses it in a task-specific model; the raw embeddings are used as input to a classifier and can also be used to update weights.

Extensions of architecture that are smaller and more efficient:

How it is used:
Take the final layer and use it as input for a new model
Fine-tuning a general-purpose pre-trained model for specific tasks

Transferable model: can be used as input to smaller task-specific models
With successful fine tuning can have very good accuracy
Pretrained Bert models are available in 100+ languages

Slow to train
It needs to be fine-tuned for different downstream tasks.

for more details, click here

Intent and Confidence:

A representation of a task that a user wants to do in a conversation. Using intents means you can treat modeling what a user does in a conversation as a multi-class classification problem.
Confidence is a measure that a model returns of how similar the current input is to a specific class in training data.

If you see a lot of models with a high confidence score, your model could be more balanced.
High confidence doesn’t mean more accurate models. Trying to maximize model confidence results in overfitting in training data
Directly comparing confidence scores from different pipelines.

Benefits of intents and confidence
Many existing methods exist
Simple to evaluate the model
Get a confidence score whose usability is discussed later.
Makes modeling conversations more tractable
Reduces uncertainty in model behavior

for more details, click here

Stop Word Removal:

Stop words that often come in the corpus but don’t contribute much to the context. A list of prevalent but uninformative words that you want to ignore.

You might want to use stop words anyway if model speed or size are important considerations for your application.

Benefits of removing stopwords:
Smaller text representations = faster and smaller models.
For some approaches removing stop words doesn’t degrade and may improve model performance.

for more details, click here


Generally, Featurizer is a part of an NLP pipeline that converts input data into features.
It takes in raw input and gives you the features that can be fed into the model. e.g., labels or tags and embeddings.

As a part of the NLP pipeline, it’s a term for a number of techniques you use to convert input tokens into machine-readable representations.

It allows you to take text input and convert it to numeric representation.

Pretrained features can be domain dependent
You may need to train your own featurizer

for more details, click here

GPT- 3

GPT-3 is also a large language model. Given an input text, it probabilistically predicts what tokens from a known vocabulary will come next. It only has the decoder part of the transformer.

The authors propose a large variety of tasks in addition to existing NLP benchmarks.
You can provide seed text that demonstrates the task at hand and produces more text based on that.

Produces a more fluent-sounding text
A single, very flexible pre-trained model that can be applied to a wide variety of tasks
Conceptually simple to use.

93 percent of training data are in English
It is costly to train
Paid API access
Unpredictable outputs.

for more details, click here

Transfer Learning:

A way to use a model trained to do a related task without training again from scratch. Using a pre-trained model and using it differently, you can spend less time on training and need less data.

Transfer learning= less training time.

Common examples to use
Word vectors
Language models
Cross-lingual transfer learning

Save time and money
Doesn’t need a huge amount of data

The more similar the target task is to the original task, the better transfer learning performs
A general-purpose model requires a lot of data to train and will probably be very big.
The final model can pick up and amplify biases in the pre-trained model.

for more details, click here