.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_gbegin_dataframe.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_gbegin_dataframe.py: Dataframe as an input ===================== .. index:: dataframe A pipeline usually ingests data as a matrix. It may be converted in a matrix if all the data share the same type. But data held in a dataframe have usually multiple types, float, integer or string for categories. ONNX also supports that case. .. contents:: :local: A dataset with categories +++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 19-58 .. code-block:: default from mlinsights.plotting import pipeline2dot import numpy import pprint from mlprodict.onnx_conv import guess_schema_from_data from onnxruntime import InferenceSession from pyquickhelper.helpgen.graphviz_helper import plot_graphviz from mlprodict.onnxrt import OnnxInference from mlprodict.onnx_conv import to_onnx as to_onnx_ext from skl2onnx import to_onnx from pandas import DataFrame from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier data = DataFrame([ dict(CAT1='a', CAT2='c', num1=0.5, num2=0.6, y=0), dict(CAT1='b', CAT2='d', num1=0.4, num2=0.8, y=1), dict(CAT1='a', CAT2='d', num1=0.5, num2=0.56, y=0), dict(CAT1='a', CAT2='d', num1=0.55, num2=0.56, y=1), dict(CAT1='a', CAT2='c', num1=0.35, num2=0.86, y=0), dict(CAT1='a', CAT2='c', num1=0.5, num2=0.68, y=1), ]) cat_cols = ['CAT1', 'CAT2'] train_data = data.drop('y', axis=1) categorical_transformer = Pipeline([ ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('cat', categorical_transformer, cat_cols)], remainder='passthrough') pipe = Pipeline([('preprocess', preprocessor), ('rf', RandomForestClassifier())]) pipe.fit(train_data, data['y']) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Pipeline(steps=[('preprocess', ColumnTransformer(remainder='passthrough', transformers=[('cat', Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))]), ['CAT1', 'CAT2'])])), ('rf', RandomForestClassifier())]) .. GENERATED FROM PYTHON SOURCE LINES 59-60 Display. .. GENERATED FROM PYTHON SOURCE LINES 60-66 .. code-block:: default dot = pipeline2dot(pipe, train_data) ax = plot_graphviz(dot) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) .. image:: /auto_examples/images/sphx_glr_plot_gbegin_dataframe_001.png :alt: plot gbegin dataframe :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 67-71 Conversion to ONNX ++++++++++++++++++ Function *to_onnx* does not handle dataframes. .. GENERATED FROM PYTHON SOURCE LINES 71-78 .. code-block:: default try: onx = to_onnx(pipe, train_data[:1]) except NotImplementedError as e: print(e) .. GENERATED FROM PYTHON SOURCE LINES 79-80 But it possible to use an extended one. .. GENERATED FROM PYTHON SOURCE LINES 80-86 .. code-block:: default onx = to_onnx_ext( pipe, train_data[:1], options={RandomForestClassifier: {'zipmap': False}}) .. GENERATED FROM PYTHON SOURCE LINES 87-89 Graph +++++ .. GENERATED FROM PYTHON SOURCE LINES 89-97 .. code-block:: default oinf = OnnxInference(onx) ax = plot_graphviz(oinf.to_dot()) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) .. image:: /auto_examples/images/sphx_glr_plot_gbegin_dataframe_002.png :alt: plot gbegin dataframe :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 98-102 Prediction with ONNX ++++++++++++++++++++ *onnxruntime* does not support dataframes. .. GENERATED FROM PYTHON SOURCE LINES 102-110 .. code-block:: default sess = InferenceSession(onx.SerializeToString()) try: sess.run(None, train_data) except Exception as e: print(e) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none run(): incompatible function arguments. The following argument types are supported: 1. (self: onnxruntime.capi.onnxruntime_pybind11_state.InferenceSession, arg0: List[str], arg1: Dict[str, object], arg2: onnxruntime.capi.onnxruntime_pybind11_state.RunOptions) -> List[object] Invoked with: , ['label', 'probabilities'], CAT1 CAT2 num1 num2 0 a c 0.50 0.60 1 b d 0.40 0.80 2 a d 0.50 0.56 3 a d 0.55 0.56 4 a c 0.35 0.86 5 a c 0.50 0.68, None .. GENERATED FROM PYTHON SOURCE LINES 111-112 Let's use a shortcut .. GENERATED FROM PYTHON SOURCE LINES 112-118 .. code-block:: default oinf = OnnxInference(onx) got = oinf.run(train_data) print(pipe.predict(train_data)) print(got['label']) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 119-120 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 120-124 .. code-block:: default print(pipe.predict_proba(train_data)) print(got['probabilities']) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [[0.84 0.16] [0.3 0.7 ] [0.78 0.22] [0.34 0.66] [0.77 0.23] [0.3 0.7 ]] [[0.84000003 0.16 ] [0.30000013 0.69999987] [0.78 0.22 ] [0.3400001 0.6599999 ] [0.77 0.22999999] [0.30000007 0.6999999 ]] .. GENERATED FROM PYTHON SOURCE LINES 125-135 It looks ok. Let's dig into the details to directly use *onnxruntime*. Unhide conversion logic with a dataframe ++++++++++++++++++++++++++++++++++++++++ A dataframe can be seen as a set of columns with different types. That's what ONNX should see: a list of inputs, the input name is the column name, the input type is the column type. .. GENERATED FROM PYTHON SOURCE LINES 135-141 .. code-block:: default init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', DoubleTensorType(shape=[None, 1])), ('num2', DoubleTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 142-143 Let's use float instead. .. GENERATED FROM PYTHON SOURCE LINES 143-152 .. code-block:: default for c in train_data.columns: if c not in cat_cols: train_data[c] = train_data[c].astype(numpy.float32) init = guess_schema_from_data(train_data) pprint.pprint(init) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [('CAT1', StringTensorType(shape=[None, 1])), ('CAT2', StringTensorType(shape=[None, 1])), ('num1', FloatTensorType(shape=[None, 1])), ('num2', FloatTensorType(shape=[None, 1]))] .. GENERATED FROM PYTHON SOURCE LINES 153-154 Let's convert with *skl2onnx* only. .. GENERATED FROM PYTHON SOURCE LINES 154-159 .. code-block:: default onx2 = to_onnx( pipe, initial_types=init, options={RandomForestClassifier: {'zipmap': False}}) .. GENERATED FROM PYTHON SOURCE LINES 160-164 Let's run it with onnxruntime. We need to convert the dataframe into a dictionary where column names become keys, and column values become values. .. GENERATED FROM PYTHON SOURCE LINES 164-169 .. code-block:: default inputs = {c: train_data[c].values.reshape((-1, 1)) for c in train_data.columns} pprint.pprint(inputs) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none {'CAT1': array([['a'], ['b'], ['a'], ['a'], ['a'], ['a']], dtype=object), 'CAT2': array([['c'], ['d'], ['d'], ['d'], ['c'], ['c']], dtype=object), 'num1': array([[0.5 ], [0.4 ], [0.5 ], [0.55], [0.35], [0.5 ]], dtype=float32), 'num2': array([[0.6 ], [0.8 ], [0.56], [0.56], [0.86], [0.68]], dtype=float32)} .. GENERATED FROM PYTHON SOURCE LINES 170-171 Inference. .. GENERATED FROM PYTHON SOURCE LINES 171-179 .. code-block:: default sess2 = InferenceSession(onx2.SerializeToString()) got2 = sess2.run(None, inputs) print(pipe.predict(train_data)) print(got2[0]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [0 1 0 1 0 1] [0 1 0 1 0 1] .. GENERATED FROM PYTHON SOURCE LINES 180-181 And probilities. .. GENERATED FROM PYTHON SOURCE LINES 181-184 .. code-block:: default print(pipe.predict_proba(train_data)) print(got2[1]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [[0.84 0.16] [0.3 0.7 ] [0.78 0.22] [0.34 0.66] [0.77 0.23] [0.3 0.7 ]] [[0.84000003 0.16 ] [0.30000037 0.69999963] [0.78 0.22000003] [0.34000033 0.65999967] [0.77 0.23000003] [0.30000037 0.69999963]] .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.078 seconds) .. _sphx_glr_download_auto_examples_plot_gbegin_dataframe.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/sdpython/onnxcustom/master?urlpath=lab/tree/notebooks/auto_examples/plot_gbegin_dataframe.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_gbegin_dataframe.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_gbegin_dataframe.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_