Version: 3.x

rasa.nlu.tokenizers.tokenizer

Token Objects

class Token()

Used by Tokenizers which split a single message into multiple Tokens.

__init__

def __init__(text: Text,
start: int,
end: Optional[int] = None,
data: Optional[Dict[Text, Any]] = None,
lemma: Optional[Text] = None) -> None

Create a Token.

Arguments:

  • text - The token text.
  • start - The start index of the token within the entire message.
  • end - The end index of the token within the entire message.
  • data - Additional token data.
  • lemma - An optional lemmatized version of the token text.

set

def set(prop: Text, info: Any) -> None

Set property value.

get

def get(prop: Text, default: Optional[Any] = None) -> Any

Returns token value.

fingerprint

def fingerprint() -> Text

Returns a stable hash for this Token.

Tokenizer Objects

class Tokenizer(GraphComponent, abc.ABC)

Base class for tokenizers.

__init__

def __init__(config: Dict[Text, Any]) -> None

Construct a new tokenizer.

create

@classmethod
def create(cls, config: Dict[Text, Any], model_storage: ModelStorage,
resource: Resource,
execution_context: ExecutionContext) -> GraphComponent

Creates a new component (see parent class for full docstring).

tokenize

@abc.abstractmethod
def tokenize(message: Message, attribute: Text) -> List[Token]

Tokenizes the text of the provided attribute of the incoming message.

process_training_data

def process_training_data(training_data: TrainingData) -> TrainingData

Tokenize all training data.

process

def process(messages: List[Message]) -> List[Message]

Tokenize the incoming messages.