module sklapi.onnx_tokenizer
#
Short summary#
module mlprodict.sklapi.onnx_tokenizer
Wrapper tokenizrs implemented in onnxruntime-extensions.
Classes#
class |
truncated documentation |
---|---|
Wraps GPT2Tokenizer … |
|
Wraps SentencePieceTokenizer … |
|
Base class for |
Properties#
property |
truncated documentation |
---|---|
|
HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should … |
|
HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should … |
|
HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should … |
Static Methods#
staticmethod |
truncated documentation |
---|---|
Methods#
method |
truncated documentation |
---|---|
|
|
|
|
|
|
|
|
The model is not trains this method is still needed to set the instance up and ready to transform. |
|
The model is not trains this method is still needed to set the instance up and ready to transform. |
|
Applies the tokenizers on an array of strings. |
|
Applies the tokenizers on an array of strings. |
Documentation#
Wrapper tokenizrs implemented in onnxruntime-extensions.
- class mlprodict.sklapi.onnx_tokenizer.GPT2TokenizerTransformer(vocab, merges, padding_length=-1, opset=None)#
Bases:
TokenizerTransformerBase
Wraps GPT2Tokenizer into a scikit-learn transformer.
- Parameters:
vocab – The content of the vocabulary file, its format is same with hugging face.
merges – The content of the merges file, its format is same with hugging face.
padding_length – When the input is a set of query, the tokenized result is ragged tensor, so we need to pad the tensor to tidy tensor and the padding_length indicates the strategy of the padding. When the padding_length equals -1, we will pad the tensor to length of longest row. When the padding_length is more than 0, we will pad the tensor to the number of padding_length.
opset – main opset to use
Method fit produces the following attributes:
onnx_: onnx graph
sess_: InferenceSession used to compute the inference
- __init__(vocab, merges, padding_length=-1, opset=None)#
- static _create_model(vocab, merges, padding_length, domain='ai.onnx.contrib', opset=None)#
- _sklearn_auto_wrap_output_keys = {'transform'}#
- fit(X, y=None, sample_weight=None)#
The model is not trains this method is still needed to set the instance up and ready to transform.
- Parameters:
X – array of strings
y – unused
sample_weight – unused
- Returns:
self
- transform(X)#
Applies the tokenizers on an array of strings.
- Parameters:
X – array to strings.
- Returns:
sparses matrix with n_features
- class mlprodict.sklapi.onnx_tokenizer.SentencePieceTokenizerTransformer(model, nbest_size=1, alpha=0.5, reverse=False, add_bos=False, add_eos=False, opset=None)#
Bases:
TokenizerTransformerBase
Wraps SentencePieceTokenizer into a scikit-learn transformer.
- Parameters:
model – The sentencepiece model serialized proto as stored as a string
nbest_size – tensor(int64) A scalar for sampling. nbest_size = {0,1}: no sampling is performed. (default) nbest_size > 1: samples from the nbest_size results. nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha – tensor(float) A scalar for a smoothing parameter. Inverse temperature for probability rescaling.
reverse – tensor(bool) Reverses the tokenized sequence.
add_bos – tensor(bool) Add beginning of sentence token to the result.
add_eos – tensor(bool) Add end of sentence token to the result When reverse=True beginning/end of sentence tokens are added after reversing
opset – main opset to use
Method fit produces the following attributes:
onnx_: onnx graph
sess_: InferenceSession used to compute the inference
- __init__(model, nbest_size=1, alpha=0.5, reverse=False, add_bos=False, add_eos=False, opset=None)#
- static _create_model(model_b64, domain='ai.onnx.contrib', opset=None)#
- _sklearn_auto_wrap_output_keys = {'transform'}#
- fit(X, y=None, sample_weight=None)#
The model is not trains this method is still needed to set the instance up and ready to transform.
- Parameters:
X – array of strings
y – unused
sample_weight – unused
- Returns:
self
- transform(X)#
Applies the tokenizers on an array of strings.
- Parameters:
X – array to strings.
- Returns:
sparses matrix with n_features
- class mlprodict.sklapi.onnx_tokenizer.TokenizerTransformerBase#
Bases:
BaseEstimator
,TransformerMixin
Base class for
SentencePieceTokenizerTransformer
andGPT2TokenizerTransformer
.- __getstate__()#
- __init__()#
- __setstate__(state)#
- _sklearn_auto_wrap_output_keys = {'transform'}#