2021-05-05 Numpy API for ONNX and scikit-learn (part II)#

This follows blog post Numpy API for ONNX and scikit-learn (part I). It demonstrated how to insert a custom function in a pipeline and still be able to convert that pipeline into ONNX. This blog post shows how to implement a custom transformer.

This time, we need to implement method not a function but the method transform of a custom transformer. The design is the same and relies on a decorator before the class declaration. In the following example, a method onnx_transform implements the method transform with the API mentioned in the first part: module npy.numpy_onnx_impl. The decorator onnxsklearn_class detects that the decorated class is a transform. It then assumes that method onnx_transform contains the ONNX implementation of method transform. The decorator adds an implementation for method transform. It behaves like the custom function described in part I. Once called, this method will detects the input type, generates the ONNX graph if not available and executes it with a runtimme. That explains why the first call is much slower.

<<<

import numpy
from pandas import DataFrame
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from mlprodict.npy import onnxsklearn_class
from mlprodict.onnx_conv import to_onnx
from mlprodict.plotting.text_plot import onnx_simple_text_plot
import mlprodict.npy.numpy_onnx_impl as nxnp
import mlprodict.npy.numpy_onnx_impl_skl as nxnpskl

X, y = make_classification(200, n_classes=2, n_features=2, n_informative=2,
                           n_redundant=0, n_clusters_per_class=2, hypercube=False)

X_train, X_test, y_train, y_test = train_test_split(X, y)


@onnxsklearn_class("onnx_transform", op_version=14)  # opset=13, 14, ...
class DecorrelateTransformerOnnx(TransformerMixin, BaseEstimator):
    def __init__(self, alpha=0.):
        BaseEstimator.__init__(self)
        TransformerMixin.__init__(self)
        self.alpha = alpha

    def fit(self, X, y=None, sample_weights=None):
        self.pca_ = PCA(X.shape[1])  # pylint: disable=W0201
        self.pca_.fit(X)
        return self

    def onnx_transform(self, X):
        if X.dtype is None:
            raise AssertionError("X.dtype cannot be None.")
        mean = self.pca_.mean_.astype(X.dtype)
        cmp = self.pca_.components_.T.astype(X.dtype)
        return (X - mean) @ cmp


model = DecorrelateTransformerOnnx()
model.fit(X_train)
print(model.transform(X_test[:5]))

onx = to_onnx(model, X_test[:5], target_opset=14)  # opset=13, 14, ...
print()
print(onnx_simple_text_plot(onx))
print()
print(onx)

>>>

    [[-0.082  1.441]
     [ 0.036 -0.162]
     [-1.234 -0.016]
     [-0.041  1.191]
     [ 1.128 -1.118]]
    
    opset: domain='' version=14
    input: name='X' type=dtype('float64') shape=[None, 2]
    init: name='init' type=dtype('float64') shape=(2,) -- array([0.217, 0.066])
    init: name='init_1' type=dtype('float64') shape=(2, 2) -- array([ 0.698,  0.716,  0.716, -0.698])
    Sub(X, init) -> out_sub_0
      MatMul(out_sub_0, init_1) -> variable1
        Identity(variable1) -> variable
    output: name='variable' type=dtype('float64') shape=[None, None]
    
    ir_version: 7
    producer_name: "skl2onnx"
    producer_version: "1.13.1"
    domain: "ai.onnx"
    model_version: 0
    doc_string: ""
    graph {
      node {
        input: "X"
        input: "init"
        output: "out_sub_0"
        name: "_sub__sub"
        op_type: "Sub"
        domain: ""
      }
      node {
        input: "out_sub_0"
        input: "init_1"
        output: "variable1"
        name: "_sub__matmul"
        op_type: "MatMul"
        domain: ""
      }
      node {
        input: "variable1"
        output: "variable"
        name: "_sub_variable"
        op_type: "Identity"
        domain: ""
      }
      name: "mlprodict_ONNX(DecorrelateTransformerOnnx)"
      initializer {
        dims: 2
        data_type: 11
        name: "init"
        raw_data: "\275\337\213O)\326\313?R\350\311\222\367\351\260?"
      }
      initializer {
        dims: 2
        dims: 2
        data_type: 11
        name: "init_1"
        raw_data: "\025\\]\273fW\346?\362\341\205\027\354\350\346?\362\341\205\027\354\350\346?\025\\]\273fW\346\277"
      }
      input {
        name: "X"
        type {
          tensor_type {
            elem_type: 11
            shape {
              dim {
              }
              dim {
                dim_value: 2
              }
            }
          }
        }
      }
      output {
        name: "variable"
        type {
          tensor_type {
            elem_type: 11
            shape {
              dim {
              }
              dim {
              }
            }
          }
        }
      }
    }
    opset_import {
      domain: ""
      version: 14
    }

The tutorial Numpy to ONNX: Create ONNX graphs with an API similar to numpy extends this example to regressors or classifiers. It also mentions a couple of frequent errors that may appear along the way.