Text Transform

The documentation is generated based on the sources available at dotnet/machinelearning and released under MIT License.

Type: datatransform Aliases: TextTransform, Text Namespace: Microsoft.ML.Runtime.Data Assembly: Microsoft.ML.Transforms.dll Microsoft Documentation: Text Transform

Description

A transform that turns a collection of text documents into numerical feature vectors. The feature vectors are normalized counts of (word and/or character) ngrams in a given tokenized text.

Parameters

Name Short name Default Description
charFeatureExtractor charExtractor NgramExtractorTransform.NgramExtractorArguments Ngram feature extractor to use for characters (WordBag/WordHashBag).
column col   New column definition (optional form: name:srcs).
dictionary dict   A dictionary of whitelisted terms.
keepDiacritics diac False Whether to keep diacritical marks or remove them.
keepNumbers num True Whether to keep numbers or remove them.
keepPunctuations punc True Whether to keep punctuation marks or remove them.
language lang English Dataset language or ‘AutoDetect’ to detect language per row.
outputTokens tokens False Whether to output the transformed text tokens as an additional column.
stopWordsRemover remover   Stopwords remover.
textCase case Lower Casing text using the rules of the invariant culture.
vectorNormalizer norm L2 Normalize vectors (rows) individually by rescaling them to unit norm.
wordFeatureExtractor wordExtractor NgramExtractorTransform.NgramExtractorArguments Ngram feature extractor to use for words (WordBag/WordHashBag).