module `sklapi.onnx_tokenizer`#

Short summary#

module mlprodict.sklapi.onnx_tokenizer

Wrapper tokenizrs implemented in onnxruntime-extensions.

Classes#

class	truncated documentation
`GPT2TokenizerTransformer`	Wraps GPT2Tokenizer …
`SentencePieceTokenizerTransformer`	Wraps SentencePieceTokenizer …
`TokenizerTransformerBase`	Base class for `SentencePieceTokenizerTransformer` and `GPT2TokenizerTransformer`.

Properties#

property	truncated documentation
`_repr_html_`	HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …
`_repr_html_`	HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …
`_repr_html_`	HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should …

Static Methods#

staticmethod	truncated documentation
`_create_model`
`_create_model`

Methods#

method	truncated documentation
`__getstate__`
`__getstate__`
`__getstate__`
`__init__`
`__init__`
`__init__`
`__setstate__`
`__setstate__`
`__setstate__`
`fit`	The model is not trains this method is still needed to set the instance up and ready to transform.
`fit`	The model is not trains this method is still needed to set the instance up and ready to transform.
`transform`	Applies the tokenizers on an array of strings.
`transform`	Applies the tokenizers on an array of strings.

Documentation#

Wrapper tokenizrs implemented in onnxruntime-extensions.

source on GitHub

class mlprodict.sklapi.onnx_tokenizer.GPT2TokenizerTransformer(vocab, merges, padding_length=-1, opset=None)#

Bases: TokenizerTransformerBase

Wraps GPT2Tokenizer into a scikit-learn transformer.

Parameters:

vocab – The content of the vocabulary file, its format is same with hugging face.
merges – The content of the merges file, its format is same with hugging face.
padding_length – When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the padding_length indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
opset – main opset to use

Method fit produces the following attributes:

onnx_: onnx graph
sess_: InferenceSession used to compute the inference

source on GitHub

__init__(vocab, merges, padding_length=-1, opset=None)#

static _create_model(vocab, merges, padding_length, domain='ai.onnx.contrib', opset=None)#

_sklearn_auto_wrap_output_keys = {'transform'}#

fit(X, y=None, sample_weight=None)#

The model is not trains this method is still needed to set the instance up and ready to transform.

Parameters:

X – array of strings
y – unused
sample_weight – unused

Returns:

self

source on GitHub

transform(X)#

Applies the tokenizers on an array of strings.

Parameters:: X – array to strings.
Returns:: sparses matrix with n_features

source on GitHub

class mlprodict.sklapi.onnx_tokenizer.SentencePieceTokenizerTransformer(model, nbest_size=1, alpha=0.5, reverse=False, add_bos=False, add_eos=False, opset=None)#

Bases: TokenizerTransformerBase

Wraps SentencePieceTokenizer into a scikit-learn transformer.

Parameters:

model – The sentencepiece model serialized proto as stored as a string
nbest_size – tensor(int64) A scalar for sampling. nbest_size = {0,1}: no sampling is performed. (default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha – tensor(float) A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
reverse – tensor(bool) Reverses the tokenized sequence.
add_bos – tensor(bool) Add beginning of sentence token to the result.
add_eos – tensor(bool) Add end of sentence token to the result When reverse=True beginning/end of sentence tokens are added after reversing
opset – main opset to use