.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorial/plot_gbegin_dataframe.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorial_plot_gbegin_dataframe.py: Dataframe as an input ===================== .. index:: dataframe A pipeline usually ingests data as a matrix. It may be converted in a matrix if all the data share the same type. But data held in a dataframe have usually multiple types, float, integer or string for categories. ONNX also supports that case. .. contents:: :local: A dataset with categories +++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 21-60 .. code-block:: default from mlinsights.plotting import pipeline2dot import numpy import pprint from mlprodict.onnx_conv import guess_schema_from_data from onnxruntime import InferenceSession from pyquickhelper.helpgen.graphviz_helper import plot_graphviz from mlprodict.onnxrt import OnnxInference from mlprodict.onnx_conv import to_onnx as to_onnx_ext from skl2onnx import to_onnx from pandas import DataFrame from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier data = DataFrame([ dict(CAT1='a', CAT2='c', num1=0.5, num2=0.6, y=0), dict(CAT1='b', CAT2='d', num1=0.4, num2=0.8, y=1), dict(CAT1='a', CAT2='d', num1=0.5, num2=0.56, y=0), dict(CAT1='a', CAT2='d', num1=0.55, num2=0.56, y=1), dict(CAT1='a', CAT2='c', num1=0.35, num2=0.86, y=0), dict(CAT1='a', CAT2='c', num1=0.5, num2=0.68, y=1), ]) cat_cols = ['CAT1', 'CAT2'] train_data = data.drop('y', axis=1) categorical_transformer = Pipeline([ ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('cat', categorical_transformer, cat_cols)], remainder='passthrough') pipe = Pipeline([('preprocess', preprocessor), ('rf', RandomForestClassifier())]) pipe.fit(train_data, data['y']) .. raw:: html
Pipeline(steps=[('preprocess',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('cat',
                                                      Pipeline(steps=[('onehot',
                                                                       OneHotEncoder(handle_unknown='ignore',
                                                                                     sparse=False))]),
                                                      ['CAT1', 'CAT2'])])),
                    ('rf', RandomForestClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 61-62 Display. .. GENERATED FROM PYTHON SOURCE LINES 62-68 .. code-block:: default dot = pipeline2dot(pipe, train_data) ax = plot_graphviz(dot) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) .. image-sg:: /auto_tutorial/images/sphx_glr_plot_gbegin_dataframe_001.png :alt: plot gbegin dataframe :srcset: /auto_tutorial/images/sphx_glr_plot_gbegin_dataframe_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 69-73 Conversion to ONNX ++++++++++++++++++ Function *to_onnx* does not handle dataframes. .. GENERATED FROM PYTHON SOURCE LINES 73-80 .. code-block:: default try: onx = to_onnx(pipe, train_data[:1]) except NotImplementedError as e: print(e) .. GENERATED FROM PYTHON SOURCE LINES 81-82 But it possible to use an extended one. .. GENERATED FROM PYTHON SOURCE LINES 82-88 .. code-block:: default onx = to_onnx_ext( pipe, train_data[:1], options={RandomForestClassifier: {'zipmap': False}}) .. GENERATED FROM PYTHON SOURCE LINES 89-91 Graph +++++ .. GENERATED FROM PYTHON SOURCE LINES 91-99 .. code-block:: default oinf = OnnxInference(onx) ax = plot_graphviz(oinf.to_dot()) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) .. image-sg:: /auto_tutorial/images/sphx_glr_plot_gbegin_dataframe_002.png :alt: plot gbegin dataframe :srcset: /auto_tutorial/images/sphx_glr_plot_gbegin_dataframe_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 100-104 Prediction with ONNX ++++++++++++++++++++ *onnxruntime* does not support dataframes. .. GENERATED FROM PYTHON SOURCE LINES 104-112 .. code-block:: default sess = InferenceSession(onx.SerializeToString()) try: sess.run(None, train_data) except Exception as e: print(e) .. rst-class:: sphx-glr-script-out .. code-block:: none run(): incompatible function arguments. The following argument types are supported: 1. (self: onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession, arg0: List[str], arg1: Dict[str, object], arg2: onnxruntime.capi.onnxruntime_pybind11_state.RunOptions) -> List[object] Invoked with: , ['label', 'probabilities'], CAT1 CAT2 num1 num2 0 a c 0.50 0.60 1 b d 0.40 0.80 2 a d 0.50 0.56 3 a d 0.55 0.56 4 a c 0.35 0.86 5 a c 0.50 0.68, None .. GENERATED FROM PYTHON SOURCE LINES 113-114 Let's use a shortcut .. GENERATED FROM PYTHON SOURCE LINES 114-120 .. code-block:: default oinf = OnnxInference(onx) got = oinf.run(train_data) print(pipe.predict(train_data)) print(got['label']) .. rst-class:: sphx-glr-script-out .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 121-122 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 122-126 .. code-block:: default print(pipe.predict_proba(train_data)) print(got['probabilities']) .. rst-class:: sphx-glr-script-out .. code-block:: none [[0.82 0.18] [0.35 0.65] [0.79 0.21] [0.34 0.66] [0.75 0.25] [0.29 0.71]] [[0.82 0.18 ] [0.35000008 0.6499999 ] [0.79 0.20999998] [0.3400001 0.6599999 ] [0.75 0.24999999] [0.29000008 0.7099999 ]] .. GENERATED FROM PYTHON SOURCE LINES 127-137 It looks ok. Let's dig into the details to directly use *onnxruntime*. Unhide conversion logic with a dataframe ++++++++++++++++++++++++++++++++++++++++ A dataframe can be seen as a set of columns with different types. That's what ONNX should see: a list of inputs, the input name is the column name, the input type is the column type. .. GENERATED FROM PYTHON SOURCE LINES 137-143 .. code-block:: default init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', DoubleTensorType(shape=[None, 1])), ('num2', DoubleTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 144-145 Let's use float instead. .. GENERATED FROM PYTHON SOURCE LINES 145-154 .. code-block:: default for c in train_data.columns: if c not in cat_cols: train_data[c] = train_data[c].astype(numpy.float32) init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', FloatTensorType(shape=[None, 1])), ('num2', FloatTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 155-156 Let's convert with *skl2onnx* only. .. GENERATED FROM PYTHON SOURCE LINES 156-161 .. code-block:: default onx2 = to_onnx( pipe, initial_types=init, options={RandomForestClassifier: {'zipmap': False}}) .. GENERATED FROM PYTHON SOURCE LINES 162-166 Let's run it with onnxruntime. We need to convert the dataframe into a dictionary where column names become keys, and column values become values. .. GENERATED FROM PYTHON SOURCE LINES 166-171 .. code-block:: default inputs = {c: train_data[c].values.reshape((-1, 1)) for c in train_data.columns} pprint.pprint(inputs) .. rst-class:: sphx-glr-script-out .. code-block:: none {'CAT1': array([['a'], ['b'], ['a'], ['a'], ['a'], ['a']], dtype=object), 'CAT2': array([['c'], ['d'], ['d'], ['d'], ['c'], ['c']], dtype=object), 'num1': array([[0.5 ], [0.4 ], [0.5 ], [0.55], [0.35], [0.5 ]], dtype=float32), 'num2': array([[0.6 ], [0.8 ], [0.56], [0.56], [0.86], [0.68]], dtype=float32)} .. GENERATED FROM PYTHON SOURCE LINES 172-173 Inference. .. GENERATED FROM PYTHON SOURCE LINES 173-181 .. code-block:: default sess2 = InferenceSession(onx2.SerializeToString()) got2 = sess2.run(None, inputs) print(pipe.predict(train_data)) print(got2[0]) .. rst-class:: sphx-glr-script-out .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 182-183 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 183-186 .. code-block:: default print(pipe.predict_proba(train_data)) print(got2[1]) .. rst-class:: sphx-glr-script-out .. code-block:: none [[0.82 0.18] [0.35 0.65] [0.79 0.21] [0.34 0.66] [0.75 0.25] [0.29 0.71]] [[0.82 0.18 ] [0.35000032 0.6499997 ] [0.78999996 0.21000002] [0.34000033 0.65999967] [0.75 0.25000003] [0.29000038 0.7099996 ]] .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.013 seconds) .. _sphx_glr_download_auto_tutorial_plot_gbegin_dataframe.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_gbegin_dataframe.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_gbegin_dataframe.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_