1. How to deal with a dataframe as input?

  2. Profile the execution

How to deal with a dataframe as input?

Each column of the dataframe is considered as an named input. The first step is to make sure that every column type is correct. pandas tends to select the least generic type to hold the content of one column. ONNX does not automatically cast the data it receives. The data must have the same type with the model is converted and when the converted model receives the data to predict.


from io import StringIO
from textwrap import dedent
import numpy
import pandas
from pyquickhelper.pycode import ExtTestCase
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from mlprodict.onnx_conv import to_onnx
from mlprodict.onnxrt import OnnxInference

text = dedent('''
text = text.replace(

X_train = pandas.read_csv(StringIO(text))
for c in X_train.columns:
    if c != 'color':
        X_train[c] = X_train[c].astype(numpy.float32)
numeric_features = [c for c in X_train if c != 'color']

pipe = Pipeline([
    ("prep", ColumnTransformer([
        ("color", Pipeline([
            ('one', OneHotEncoder()),
            ('select', ColumnTransformer(
                [('sel1', 'passthrough', [0])]))
        ]), ['color']),
        ("others", "passthrough", numeric_features)

pred = pipe.transform(X_train)

model_onnx = to_onnx(pipe, X_train, target_opset=12)
oinf = OnnxInference(model_onnx)

# The dataframe is converted into a dictionary,
# each key is a column name, each value is a numpy array.
inputs = {c: X_train[c].values for c in X_train.columns}
inputs = {c: v.reshape((v.shape[0], 1)) for c, v in inputs.items()}

onxp = oinf.run(inputs)


    [[1.000e+00 7.400e+00 7.000e-01 0.000e+00 1.900e+00 7.600e-02 1.100e+01
      3.400e+01 9.978e-01 3.510e+00 5.600e-01 9.400e+00 5.000e+00]
     [1.000e+00 7.800e+00 8.800e-01 0.000e+00 2.600e+00 9.800e-02 2.500e+01
      6.700e+01 9.968e-01 3.200e+00 6.800e-01 9.800e+00 5.000e+00]
     [1.000e+00 7.800e+00 7.600e-01 4.000e-02 2.300e+00 9.200e-02 1.500e+01
      5.400e+01 9.970e-01 3.260e+00 6.500e-01 9.800e+00 5.000e+00]
     [1.000e+00 1.120e+01 2.800e-01 5.600e-01 1.900e+00 7.500e-02 1.700e+01
      6.000e+01 9.980e-01 3.160e+00 5.800e-01 9.800e+00 6.000e+00]]
    {'transformed_column': array([[1.000e+00, 7.400e+00, 7.000e-01, 0.000e+00, 1.900e+00, 7.600e-02,
            1.100e+01, 3.400e+01, 9.978e-01, 3.510e+00, 5.600e-01, 9.400e+00,
           [1.000e+00, 7.800e+00, 8.800e-01, 0.000e+00, 2.600e+00, 9.800e-02,
            2.500e+01, 6.700e+01, 9.968e-01, 3.200e+00, 6.800e-01, 9.800e+00,
           [1.000e+00, 7.800e+00, 7.600e-01, 4.000e-02, 2.300e+00, 9.200e-02,
            1.500e+01, 5.400e+01, 9.970e-01, 3.260e+00, 6.500e-01, 9.800e+00,
           [1.000e+00, 1.120e+01, 2.800e-01, 5.600e-01, 1.900e+00, 7.500e-02,
            1.700e+01, 6.000e+01, 9.980e-01, 3.160e+00, 5.800e-01, 9.800e+00,
            6.000e+00]], dtype=float32)}

(original entry : convert.py:docstring of mlprodict.onnx_conv.convert.to_onnx, line 45)

Profile the execution

py-spy can be used to profile the execution of a program. The profile is more informative if the code is compiled with debug information.

py-spy record --native -r 10 -o plot_random_forest_reg.svg -- python plot_random_forest_reg.py

(original entry : plot_opml_random_forest_reg.rst, line 51)