.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "gyexamples/plot_onnx_tree_ensemble_parallel.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_gyexamples_plot_onnx_tree_ensemble_parallel.py: .. _onnxtreeensembleparallelrst: TreeEnsembleRegressor and parallelisation ========================================= The operator `TreeEnsembleClassifier `_ describe any tree model (decision tree, random forest, gradient boosting). The runtime is usually implements in C/C++ and uses parallelisation. The notebook studies the impact of the parallelisation. .. contents:: :local Graph +++++ The following dummy graph shows the time ratio between two runtimes depending on the number of observations in a batch (N) and the number of trees in the forest. .. GENERATED FROM PYTHON SOURCE LINES 25-60 .. code-block:: default from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from mlprodict.onnxrt import OnnxInference from onnxruntime import InferenceSession from skl2onnx import to_onnx from mlprodict.onnxrt.validate.validate_benchmark import benchmark_fct import sklearn import numpy from tqdm import tqdm from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_classification import matplotlib.pyplot as plt from mlprodict.plotting.plotting import plot_benchmark_metrics def plot_metric(metric, ax=None, xlabel="N", ylabel="trees", middle=1., transpose=False, shrink=1.0, title=None, figsize=None): if figsize is not None and ax is None: _, ax = plt.subplots(1, 1, figsize=figsize) ax, cbar = plot_benchmark_metrics( metric, ax=ax, xlabel=xlabel, ylabel=ylabel, middle=middle, transpose=transpose, cbar_kw={'shrink': shrink}) if title is not None: ax.set_title(title) return ax data = {(1, 1): 0.1, (10, 1): 1, (1, 10): 2, (10, 10): 100, (100, 1): 100, (100, 10): 1000} fig, ax = plt.subplots(1, 2, figsize=(10, 4)) plot_metric(data, ax[0], shrink=0.6) .. image-sg:: /gyexamples/images/sphx_glr_plot_onnx_tree_ensemble_parallel_001.png :alt: plot onnx tree ensemble parallel :srcset: /gyexamples/images/sphx_glr_plot_onnx_tree_ensemble_parallel_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 62-66 .. code-block:: default plot_metric(data, ax[1], transpose=True) .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 67-72 scikit-learn: T trees vs 1 tree +++++++++++++++++++++++++++++++ Let's do first compare a *GradientBoostingClassifier* from *scikit-learn* with 1 tree against multiple trees. .. GENERATED FROM PYTHON SOURCE LINES 72-84 .. code-block:: default # In[4]: ntest = 10000 X, y = make_classification( n_samples=10000 + ntest, n_features=10, n_informative=5, n_classes=2, random_state=11) X_train, X_test, y_train, y_test = X[:- ntest], X[-ntest:], y[:-ntest], y[-ntest:] .. GENERATED FROM PYTHON SOURCE LINES 86-99 .. code-block:: default ModelToTest = GradientBoostingClassifier N = [1, 10, 100, 1000, 10000] T = [1, 2, 10, 20, 50] models = {} for nt in tqdm(T): rf = ModelToTest(n_estimators=nt, max_depth=7).fit(X_train, y_train) models[nt] = rf .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 165-173 As expected, all ratio on first line are close to 1 since both models are the same. fourth line, second column (T=20, N=10) means an ensemble with 20 trees is slower to compute the predictions of 10 observations in a batch compare to an ensemble with 1 tree. scikit-learn against onnxuntime +++++++++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 173-183 .. code-block:: default X32 = X_test.astype(numpy.float32) models_onnx = {t: to_onnx(m, X32[:1]) for t, m in models.items()} sess_models = {t: InferenceSession(mo.SerializeToString()) for t, mo in models_onnx.items()} .. GENERATED FROM PYTHON SOURCE LINES 184-185 Benchmark. .. GENERATED FROM PYTHON SOURCE LINES 185-194 .. code-block:: default bench_ort = tree_benchmark( X_test.astype(numpy.float32), lambda t: models[t].predict_proba, lambda t: (lambda x, t_=t, se=sess_models: se[t_].run(None, {'X': x})), T, N) bench_ort .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 203-211 We see onnxruntime is fast for small batches, still faster but not that much for big batches. ZipMap operator +++++++++++++++ ZipMap just creates a new container for the same results. The copy may impact the ratio. Let's remove it from the equation. .. GENERATED FROM PYTHON SOURCE LINES 211-221 .. code-block:: default X32 = X_test.astype(numpy.float32) models_onnx = {t: to_onnx(m, X32[:1], options={ModelToTest: {'zipmap': False}}) for t, m in models.items()} sess_models = {t: InferenceSession(mo.SerializeToString()) for t, mo in models_onnx.items()} .. GENERATED FROM PYTHON SOURCE LINES 222-223 Benchmarks. .. GENERATED FROM PYTHON SOURCE LINES 223-233 .. code-block:: default bench_ort = tree_benchmark( X_test.astype(numpy.float32), lambda t: models[t].predict_proba, lambda t: (lambda x, t_=t, se=sess_models: se[t_].run(None, {'X': x})), T, N) bench_ort .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00`_ # * `op_tree_ensemble_common_p_.hpp `_ # # The runtime builds a tree structure, computes the output of every # tree and then agregates them. The implementation distringuishes # when the batch size contains only 1 observations or many. # It parallelizes on the following conditions: # * if the batch size $N \geqslant N_0$, it then parallelizes per # observation, asuming every one is independant, # * if the batch size $N = 1$ and the number of trees # $T \geqslant T_0$, it then parallelizes per tree. # # scikit-learn against mlprodict, no parallelisation # ++++++++++++++++++++++++++++++++++++++++++++++++++ oinf_models = {t: OnnxInference(mo, runtime="python_compiled") for t, mo in models_onnx.items()} .. image-sg:: /gyexamples/images/sphx_glr_plot_onnx_tree_ensemble_parallel_004.png :alt: scikit-learn vs onnxruntime (no zipmap) < 1 means onnxruntime is faster :srcset: /gyexamples/images/sphx_glr_plot_onnx_tree_ensemble_parallel_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 271-272 Let's modify the thresholds which trigger the parallelization. .. GENERATED FROM PYTHON SOURCE LINES 272-278 .. code-block:: default for _, oinf in oinf_models.items(): oinf.sequence_[0].ops_.rt_.omp_tree_ = 10000000 oinf.sequence_[0].ops_.rt_.omp_N_ = 10000000 .. GENERATED FROM PYTHON SOURCE LINES 279-280 Benchmarks. .. GENERATED FROM PYTHON SOURCE LINES 280-289 .. code-block:: default bench_mlp = tree_benchmark( X_test.astype(numpy.float32), lambda t: models[t].predict, lambda t: (lambda x, t_=t, oi=oinf_models: oi[t_].run({'X': x})), T, N) bench_mlp .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 296-297 Let's compare *onnxruntime* against *mlprodict*. .. GENERATED FROM PYTHON SOURCE LINES 297-306 .. code-block:: default bench_mlp_ort = tree_benchmark( X_test.astype(numpy.float32), lambda t: (lambda x, t_=t, se=sess_models: se[t_].run(None, {'X': x})), lambda t: (lambda x, t_=t, oi=oinf_models: oi[t_].run({'X': x})), T, N) bench_mlp_ort .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 314-317 This implementation is faster except for high number of trees or high number of observations. Let's add parallelisation for trees and observations. .. GENERATED FROM PYTHON SOURCE LINES 317-332 .. code-block:: default for _, oinf in oinf_models.items(): oinf.sequence_[0].ops_.rt_.omp_tree_ = 2 oinf.sequence_[0].ops_.rt_.omp_N_ = 2 bench_mlp_para = tree_benchmark( X_test.astype(numpy.float32), lambda t: models[t].predict, lambda t: (lambda x, t_=t, oi=oinf_models: oi[t_].run({'X': x})), T, N) bench_mlp_para .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 340-342 Parallelisation does improve the computation time when N is big. Let's compare with and without parallelisation. .. GENERATED FROM PYTHON SOURCE LINES 342-352 .. code-block:: default bench_para = {} for k, v in bench_mlp.items(): bench_para[k] = bench_mlp_para[k] / v plot_metric(bench_para, title="mlprodict vs mlprodict parallelized\n < 1 " "means parallelisation is faster") .. image-sg:: /gyexamples/images/sphx_glr_plot_onnx_tree_ensemble_parallel_008.png :alt: mlprodict vs mlprodict parallelized < 1 means parallelisation is faster :srcset: /gyexamples/images/sphx_glr_plot_onnx_tree_ensemble_parallel_008.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 353-356 Parallelisation per trees does not seem to be efficient. Let's confirm with a proper benchmark as the previous merges results from two benchmarks. .. GENERATED FROM PYTHON SOURCE LINES 356-377 .. code-block:: default for _, oinf in oinf_models.items(): oinf.sequence_[0].ops_.rt_.omp_tree_ = 1000000 oinf.sequence_[0].ops_.rt_.omp_N_ = 1000000 oinf_models_para = {t: OnnxInference(mo, runtime="python_compiled") for t, mo in models_onnx.items()} for _, oinf in oinf_models_para.items(): oinf.sequence_[0].ops_.rt_.omp_tree_ = 2 oinf.sequence_[0].ops_.rt_.omp_N_ = 2 bench_mlp_para = tree_benchmark( X_test.astype(numpy.float32), lambda t: (lambda x, t_=t, oi=oinf_models: oi[t_].run({'X': x})), lambda t: (lambda x, t_=t, oi=oinf_models_para: oi[t_].run({'X': x})), T, N, repeat=20, number=20) bench_mlp_para .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 385-394 It should be run on different machines. On the current one, parallelisation per trees (when N=1) does not seem to help. Parallelisation for a small number of observations does not seem to help either. So we need to find some threshold. Parallelisation per trees +++++++++++++++++++++++++ Let's study the parallelisation per tree. We need to train new models. .. GENERATED FROM PYTHON SOURCE LINES 394-406 .. code-block:: default # In[33]: N2 = [1, 10] T2 = [1, 2, 10, 50, 100, 150, 200, 300, 400, 500] models2 = {} for nt in tqdm(T2): rf = ModelToTest(n_estimators=nt, max_depth=7).fit(X_train, y_train) models2[nt] = rf .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/10 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 446-458 The parallelisation starts to be below 1 after 400 trees. For 10 observations, there is no parallelisation neither by trees nor by observations. Ratios are close to 1. The gain obviously depends on the tree depth. You can try with a different max depth and the number of trees parallelisation becomes interesting depending on the tree depth. Multi-Class DecisionTreeClassifier ++++++++++++++++++++++++++++++++++ Same experiment when the number of tree is 1 but then we change the number of classes. .. GENERATED FROM PYTHON SOURCE LINES 458-492 .. code-block:: default ModelToTest = DecisionTreeClassifier C = [2, 5, 10, 15, 20, 30, 40, 50] N = [1, 10, 100, 1000, 10000] trees = {} for cl in tqdm(C): ntest = 10000 X, y = make_classification( n_samples=10000 + ntest, n_features=12, n_informative=8, n_classes=cl, random_state=11) X_train, X_test, y_train, y_test = ( X[:-ntest], X[-ntest:], y[:-ntest], y[-ntest:]) dt = ModelToTest(max_depth=7).fit(X_train, y_train) X32 = X_test.astype(numpy.float32) monnx = to_onnx(dt, X32[:1]) oinf = OnnxInference(monnx) oinf.sequence_[0].ops_.rt_.omp_N_ = 1000000 trees[cl] = dict(model=dt, X_test=X_test, X32=X32, monnx=monnx, oinf=oinf) bench_dt = tree_benchmark(lambda cl: trees[cl]['X32'], lambda cl: trees[cl]['model'].predict_proba, lambda cl: ( lambda x, c=cl: trees[c]['oinf'].run({'X': x})), C, N) bench_dt .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/8 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 501-503 Multi-class LogisticRegression ++++++++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 503-536 .. code-block:: default ModelToTest = LogisticRegression C = [2, 5, 10, 15, 20] N = [1, 10, 100, 1000, 10000] models = {} for cl in tqdm(C): ntest = 10000 X, y = make_classification( n_samples=10000 + ntest, n_features=10, n_informative=6, n_classes=cl, random_state=11) X_train, X_test, y_train, y_test = ( X[:-ntest], X[-ntest:], y[:-ntest], y[-ntest:]) model = ModelToTest().fit(X_train, y_train) X32 = X_test.astype(numpy.float32) monnx = to_onnx(model, X32[:1]) oinf = OnnxInference(monnx) models[cl] = dict(model=model, X_test=X_test, X32=X32, monnx=monnx, oinf=oinf) bench_lr = tree_benchmark(lambda cl: models[cl]['X32'], lambda cl: models[cl]['model'].predict_proba, lambda cl: ( lambda x, c=cl: trees[c]['oinf'].run({'X': x})), C, N) bench_lr .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 545-547 Decision Tree and number of features ++++++++++++++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 547-579 .. code-block:: default ModelToTest = DecisionTreeClassifier NF = [2, 10, 20, 40, 50, 70, 100, 200, 500, 1000] N = [1, 10, 100, 1000, 10000, 50000] trees_nf = {} for nf in tqdm(NF): ntest = 10000 X, y = make_classification( n_samples=10000 + ntest, n_features=nf, n_informative=nf // 2 + 1, n_redundant=0, n_repeated=0, n_classes=2, random_state=11) X_train, X_test, y_train, y_test = ( X[:-ntest], X[-ntest:], y[:-ntest], y[-ntest:]) dt = ModelToTest(max_depth=7).fit(X_train, y_train) X32 = X_test.astype(numpy.float32) monnx = to_onnx(dt, X32[:1]) oinf = OnnxInference(monnx) oinf.sequence_[0].ops_.rt_.omp_N_ = 1000000 trees_nf[nf] = dict(model=dt, X_test=X_test, X32=X32, monnx=monnx, oinf=oinf) bench_dt_nf = tree_benchmark( lambda nf: trees_nf[nf]['X32'], lambda nf: trees_nf[nf]['model'].predict_proba, lambda nf: (lambda x, c=nf: trees_nf[c]['oinf'].run({'X': x})), NF, N) bench_dt_nf .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/10 [00:00 .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 36 minutes 33.124 seconds) .. _sphx_glr_download_gyexamples_plot_onnx_tree_ensemble_parallel.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_onnx_tree_ensemble_parallel.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_onnx_tree_ensemble_parallel.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_