.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_tutorial/plot_gexternal_lightgbm_reg.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_tutorial_plot_gexternal_lightgbm_reg.py: .. _example-lightgbm-reg: Convert a pipeline with a LightGBM regressor ============================================ .. index:: LightGBM The discrepancies observed when using float and TreeEnsemble operator (see :ref:`l-example-discrepencies-float-double`) explains why the converter for *LGBMRegressor* may introduce significant discrepancies even when it is used with float tensors. Library *lightgbm* is implemented with double. A random forest regressor with multiple trees computes its prediction by adding the prediction of every tree. After being converting into ONNX, this summation becomes :math:`\left[\sum\right]_{i=1}^F float(T_i(x))`, where *F* is the number of trees in the forest, :math:`T_i(x)` the output of tree *i* and :math:`\left[\sum\right]` a float addition. The discrepancy can be expressed as :math:`D(x) = |\left[\sum\right]_{i=1}^F float(T_i(x)) - \sum_{i=1}^F T_i(x)|`. This grows with the number of trees in the forest. To reduce the impact, an option was added to split the node *TreeEnsembleRegressor* into multiple ones and to do a summation with double this time. If we assume the node if split into *a* nodes, the discrepancies then become :math:`D'(x) = |\sum_{k=1}^a \left[\sum\right]_{i=1}^{F/a} float(T_{ak + i}(x)) - \sum_{i=1}^F T_i(x)|`. .. contents:: :local: Train a LGBMRegressor +++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 41-64 .. code-block:: default import packaging.version as pv import warnings import timeit import numpy from pandas import DataFrame import matplotlib.pyplot as plt from tqdm import tqdm from lightgbm import LGBMRegressor from onnxruntime import InferenceSession from skl2onnx import to_onnx, update_registered_converter from skl2onnx.common.shape_calculator import calculate_linear_regressor_output_shapes # noqa from onnxmltools import __version__ as oml_version from onnxmltools.convert.lightgbm.operator_converters.LightGbm import convert_lightgbm # noqa N = 1000 X = numpy.random.randn(N, 20) y = (numpy.random.randn(N) + numpy.random.randn(N) * 100 * numpy.random.randint(0, 1, 1000)) reg = LGBMRegressor(n_estimators=1000) reg.fit(X, y) .. raw:: html

LGBMRegressor(n_estimators=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 65-76 Register the converter for LGBMClassifier +++++++++++++++++++++++++++++++++++++++++ The converter is implemented in :epkg:`onnxmltools`: `onnxmltools...LightGbm.py `_. and the shape calculator: `onnxmltools...Regressor.py `_. .. GENERATED FROM PYTHON SOURCE LINES 76-97 .. code-block:: default def skl2onnx_convert_lightgbm(scope, operator, container): options = scope.get_options(operator.raw_operator) if 'split' in options: if pv.Version(oml_version) < pv.Version('1.9.2'): warnings.warn( "Option split was released in version 1.9.2 but %s is " "installed. It will be ignored." % oml_version) operator.split = options['split'] else: operator.split = None convert_lightgbm(scope, operator, container) update_registered_converter( LGBMRegressor, 'LightGbmLGBMRegressor', calculate_linear_regressor_output_shapes, skl2onnx_convert_lightgbm, options={'split': None}) .. GENERATED FROM PYTHON SOURCE LINES 98-104 Convert +++++++ We convert the same model following the two scenarios, one single TreeEnsembleRegressor node, or more. *split* parameter is the number of trees per node TreeEnsembleRegressor. .. GENERATED FROM PYTHON SOURCE LINES 104-111 .. code-block:: default model_onnx = to_onnx(reg, X[:1].astype(numpy.float32), target_opset={'': 14, 'ai.onnx.ml': 2}) model_onnx_split = to_onnx(reg, X[:1].astype(numpy.float32), target_opset={'': 14, 'ai.onnx.ml': 2}, options={'split': 100}) .. GENERATED FROM PYTHON SOURCE LINES 112-114 Discrepancies +++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 114-130 .. code-block:: default sess = InferenceSession(model_onnx.SerializeToString()) sess_split = InferenceSession(model_onnx_split.SerializeToString()) X32 = X.astype(numpy.float32) expected = reg.predict(X32) got = sess.run(None, {'X': X32})[0].ravel() got_split = sess_split.run(None, {'X': X32})[0].ravel() disp = numpy.abs(got - expected).sum() disp_split = numpy.abs(got_split - expected).sum() print("sum of discrepancies 1 node", disp) print("sum of discrepancies split node", disp_split, "ratio:", disp / disp_split) .. rst-class:: sphx-glr-script-out .. code-block:: none sum of discrepancies 1 node 0.00012932610302943083 sum of discrepancies split node 4.181837350948408e-05 ratio: 3.0925665485316083 .. GENERATED FROM PYTHON SOURCE LINES 131-133 The sum of the discrepancies were reduced 4, 5 times. The maximum is much better too. .. GENERATED FROM PYTHON SOURCE LINES 133-140 .. code-block:: default disc = numpy.abs(got - expected).max() disc_split = numpy.abs(got_split - expected).max() print("max discrepancies 1 node", disc) print("max discrepancies split node", disc_split, "ratio:", disc / disc_split) .. rst-class:: sphx-glr-script-out .. code-block:: none max discrepancies 1 node 1.307127771354999e-06 max discrepancies split node 3.2222179946472806e-07 ratio: 4.056608750638187 .. GENERATED FROM PYTHON SOURCE LINES 141-145 Processing time +++++++++++++++ The processing time is slower but not much. .. GENERATED FROM PYTHON SOURCE LINES 145-153 .. code-block:: default print("processing time no split", timeit.timeit( lambda: sess.run(None, {'X': X32})[0], number=150)) print("processing time split", timeit.timeit( lambda: sess_split.run(None, {'X': X32})[0], number=150)) .. rst-class:: sphx-glr-script-out .. code-block:: none processing time no split 6.361445982940495 processing time split 6.967872331850231 .. GENERATED FROM PYTHON SOURCE LINES 154-159 Split influence +++++++++++++++ Let's see how the sum of the discrepancies moves against the parameter *split*. .. GENERATED FROM PYTHON SOURCE LINES 159-174 .. code-block:: default res = [] for i in tqdm(list(range(20, 170, 20)) + [200, 300, 400, 500]): model_onnx_split = to_onnx(reg, X[:1].astype(numpy.float32), target_opset={'': 14, 'ai.onnx.ml': 2}, options={'split': i}) sess_split = InferenceSession(model_onnx_split.SerializeToString()) got_split = sess_split.run(None, {'X': X32})[0].ravel() disc_split = numpy.abs(got_split - expected).max() res.append(dict(split=i, disc=disc_split)) df = DataFrame(res).set_index('split') df["baseline"] = disc print(df) .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/12 [00:00` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_gexternal_lightgbm_reg.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_