module sklapi.onnx_tokenizer#

Inheritance diagram of mlprodict.sklapi.onnx_tokenizer

Short summary#

module mlprodict.sklapi.onnx_tokenizer

Wrapper tokenizrs implemented in onnxruntime-extensions.

source on GitHub

Classes#

class

truncated documentation

GPT2TokenizerTransformer

Wraps GPT2Tokenizer

SentencePieceTokenizerTransformer

Wraps SentencePieceTokenizer

TokenizerTransformerBase

Base class for SentencePieceTokenizerTransformer and GPT2TokenizerTransformer.

Properties#

property

truncated documentation

_repr_html_

HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …

_repr_html_

HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …

_repr_html_

HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …

Static Methods#

staticmethod

truncated documentation

_create_model

_create_model

Methods#

method

truncated documentation

__getstate__

__getstate__

__getstate__

__init__

__init__

__init__

__setstate__

__setstate__

__setstate__

fit

The model is not trains this method is still needed to set the instance up and ready to transform.

fit

The model is not trains this method is still needed to set the instance up and ready to transform.

transform

Applies the tokenizers on an array of strings.

transform

Applies the tokenizers on an array of strings.

Documentation#

Wrapper tokenizrs implemented in onnxruntime-extensions.

source on GitHub

class mlprodict.sklapi.onnx_tokenizer.GPT2TokenizerTransformer(vocab, merges, padding_length=-1, opset=None)#

Bases: TokenizerTransformerBase

Wraps GPT2Tokenizer into a scikit-learn transformer.

Parameters:
  • vocab – The content of the vocabulary file, its format is same with hugging face.

  • merges – The content of the merges file, its format is same with hugging face.

  • padding_length – When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the padding_length indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.

  • opset – main opset to use

Method fit produces the following attributes:

source on GitHub

__init__(vocab, merges, padding_length=-1, opset=None)#
static _create_model(vocab, merges, padding_length, domain='ai.onnx.contrib', opset=None)#
fit(X, y=None, sample_weight=None)#

The model is not trains this method is still needed to set the instance up and ready to transform.

Parameters:
  • X – array of strings

  • y – unused

  • sample_weight – unused

Returns:

self

source on GitHub

transform(X)#

Applies the tokenizers on an array of strings.

Parameters:

X – array to strings.

Returns:

sparses matrix with n_features

source on GitHub

class mlprodict.sklapi.onnx_tokenizer.SentencePieceTokenizerTransformer(model, nbest_size=1, alpha=0.5, reverse=False, add_bos=False, add_eos=False, opset=None)#

Bases: TokenizerTransformerBase

Wraps SentencePieceTokenizer into a scikit-learn transformer.

Parameters:
  • model – The sentencepiece model serialized proto as stored as a string

  • nbest_size – tensor(int64) A scalar for sampling. nbest_size = {0,1}: no sampling is performed. (default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

  • alpha – tensor(float) A scalar for a smoothing parameter. Inverse temperature for probability rescaling.

  • reverse – tensor(bool) Reverses the tokenized sequence.

  • add_bos – tensor(bool) Add beginning of sentence token to the result.

  • add_eos – tensor(bool) Add end of sentence token to the result When reverse=True beginning/end of sentence tokens are added after reversing

  • opset – main opset to use

Method fit produces the following attributes:

source on GitHub

__init__(model, nbest_size=1, alpha=0.5, reverse=False, add_bos=False, add_eos=False, opset=None)#
static _create_model(model_b64, domain='ai.onnx.contrib', opset=None)#
fit(X, y=None, sample_weight=None)#

The model is not trains this method is still needed to set the instance up and ready to transform.

Parameters:
  • X – array of strings

  • y – unused

  • sample_weight – unused

Returns:

self

source on GitHub

transform(X)#

Applies the tokenizers on an array of strings.

Parameters:

X – array to strings.

Returns:

sparses matrix with n_features

source on GitHub

class mlprodict.sklapi.onnx_tokenizer.TokenizerTransformerBase#

Bases: BaseEstimator, TransformerMixin

Base class for SentencePieceTokenizerTransformer and GPT2TokenizerTransformer.

source on GitHub

__getstate__()#
__init__()#
__setstate__(state)#