.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "gyexamples/plot_parallelism.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_gyexamples_plot_parallelism.py: .. _l-example-parallelism: When to parallelize? ==================== That is the question. Parallize computation takes some time to set up, it is not the right solution in every case. The following example studies the parallelism introduced into the runtime of *TreeEnsembleRegressor* to see when it is best to do it. .. contents:: :local: .. GENERATED FROM PYTHON SOURCE LINES 18-33 .. code-block:: default from pprint import pprint import numpy from pandas import DataFrame import matplotlib.pyplot as plt from tqdm import tqdm from sklearn import config_context from sklearn.datasets import make_regression from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import train_test_split from cpyquickhelper.numbers import measure_time from pyquickhelper.pycode.profiling import profile from mlprodict.onnx_conv import to_onnx, register_rewritten_operators from mlprodict.onnxrt import OnnxInference from mlprodict.tools.model_info import analyze_model .. GENERATED FROM PYTHON SOURCE LINES 34-35 Available optimisations on this machine. .. GENERATED FROM PYTHON SOURCE LINES 35-40 .. code-block:: default from mlprodict.testing.experimental_c_impl.experimental_c import code_optimisation print(code_optimisation()) .. rst-class:: sphx-glr-script-out .. code-block:: none AVX-omp=8 .. GENERATED FROM PYTHON SOURCE LINES 41-43 Training and converting a model +++++++++++++++++++++++++++++++ .. GENERATED FROM PYTHON SOURCE LINES 43-53 .. code-block:: default data = make_regression(50000, 20) X, y = data X_train, X_test, y_train, y_test = train_test_split(X, y) hgb = HistGradientBoostingRegressor(max_iter=100, max_depth=6) hgb.fit(X_train, y_train) print(hgb) .. rst-class:: sphx-glr-script-out .. code-block:: none HistGradientBoostingRegressor(max_depth=6) .. GENERATED FROM PYTHON SOURCE LINES 54-55 Let's get more statistics about the model itself. .. GENERATED FROM PYTHON SOURCE LINES 55-57 .. code-block:: default pprint(analyze_model(hgb)) .. rst-class:: sphx-glr-script-out .. code-block:: none {'_predictors.max|tree_.max_depth': 6, '_predictors.size': 100, '_predictors.sum|tree_.leave_count': 3100, '_predictors.sum|tree_.node_count': 6100, 'train_score_.shape': 101, 'validation_score_.shape': 101} .. GENERATED FROM PYTHON SOURCE LINES 58-59 And let's convert it. .. GENERATED FROM PYTHON SOURCE LINES 59-66 .. code-block:: default register_rewritten_operators() onx = to_onnx(hgb, X_train[:1].astype(numpy.float32)) oinf = OnnxInference(onx, runtime='python_compiled') print(oinf) .. rst-class:: sphx-glr-script-out .. code-block:: none OnnxInference(...) def compiled_run(dict_inputs, yield_ops=None, context=None, attributes=None): if yield_ops is not None: raise NotImplementedError('yields_ops should be None.') # inputs X = dict_inputs['X'] (variable, ) = n0_treeensembleregressor_3(X) return { 'variable': variable, } .. GENERATED FROM PYTHON SOURCE LINES 67-68 The runtime of the forest is in the following object. .. GENERATED FROM PYTHON SOURCE LINES 68-72 .. code-block:: default print(oinf.sequence_[0].ops_) print(oinf.sequence_[0].ops_.rt_) .. rst-class:: sphx-glr-script-out .. code-block:: none TreeEnsembleRegressor_3( op_type=TreeEnsembleRegressor aggregate_function=b'SUM', base_values=[0.62794507], base_values_as_tensor=[], domain=ai.onnx.ml, inplaces={}, ir_version=8, n_targets=1, nodes_falsenodeids=[34 17 10 ... 60 0 0], nodes_featureids=[12 18 13 ... 4 0 0], nodes_hitrates=[1. 1. 1. ... 1. 1. 1.], nodes_hitrates_as_tensor=[], nodes_missing_value_tracks_true=[1 1 1 ... 1 0 0], nodes_modes=[b'BRANCH_LEQ' b'BRANCH_LEQ' b'BRANCH_LEQ' ... b'BRANCH_LEQ' b'LEAF' b'LEAF'], nodes_nodeids=[ 0 1 2 ... 58 59 60], nodes_treeids=[ 0 0 0 ... 99 99 99], nodes_truenodeids=[ 1 2 3 ... 59 0 0], nodes_values=[0.21894096 0.06143481 0.02431714 ... 0.15920539 0. 0. ], nodes_values_as_tensor=[], parallel=(60, 128, 20), post_transform=b'NONE', runtime=None, target_ids=[0 0 0 ... 0 0 0], target_nodeids=[ 4 6 8 ... 57 59 60], target_opset=3, target_treeids=[ 0 0 0 ... 99 99 99], target_weights=[-25.663 -19.885317 -16.915827 ... 1.1101708 1.9407381 3.5393353], target_weights_as_tensor=[], ) .. GENERATED FROM PYTHON SOURCE LINES 73-75 And the threshold used to start parallelizing based on the number of observations. .. GENERATED FROM PYTHON SOURCE LINES 75-79 .. code-block:: default print(oinf.sequence_[0].ops_.rt_.omp_N_) .. rst-class:: sphx-glr-script-out .. code-block:: none 20 .. GENERATED FROM PYTHON SOURCE LINES 80-87 Profiling +++++++++ This step involves :epkg:`pyinstrument` to measure where the time is spent. Both :epkg:`scikit-learn` and :epkg:`mlprodict` runtime are called so that the prediction times can be compared. .. GENERATED FROM PYTHON SOURCE LINES 87-102 .. code-block:: default X32 = X_test.astype(numpy.float32) def runlocal(): with config_context(assume_finite=True): for i in range(0, 100): oinf.run({'X': X32[:1000]}) hgb.predict(X_test[:1000]) print("profiling...") txt = profile(runlocal, pyinst_format='text') print(txt[1]) .. rst-class:: sphx-glr-script-out .. code-block:: none profiling... _ ._ __/__ _ _ _ _ _/_ Recorded: 04:41:12 AM Samples: 6069 /_//_/// /_\ / //_// / //_'/ // Duration: 84.956 CPU time: 584.970 / _/ v4.4.0 Program: somewhere/workspace/mlprodict/mlprodict_UT_39_std/_doc/examples/plot_parallelism.py 84.937 profile ../pycode/profiling.py:455 `- 84.937 runlocal plot_parallelism.py:91 [42 frames hidden] plot_parallelism, sklearn, ... .. GENERATED FROM PYTHON SOURCE LINES 103-108 Now let's measure the performance the average computation time per observations for 2 to 100 observations. The runtime implemented in :epkg:`mlprodict` parallizes the computation after a given number of observations. .. GENERATED FROM PYTHON SOURCE LINES 108-134 .. code-block:: default obs = [] for N in tqdm(list(range(2, 21))): m = measure_time("oinf.run({'X': x})", {'oinf': oinf, 'x': X32[:N]}, div_by_number=True, number=20) m['N'] = N m['RT'] = 'ONNX' obs.append(m) with config_context(assume_finite=True): m = measure_time("hgb.predict(x)", {'hgb': hgb, 'x': X32[:N]}, div_by_number=True, number=15) m['N'] = N m['RT'] = 'SKL' obs.append(m) df = DataFrame(obs) num = ['min_exec', 'average', 'max_exec'] for c in num: df[c] /= df['N'] df.head() .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/19 [00:00
average deviation min_exec max_exec repeat number ttime context_size N RT
0 0.004714 0.004948 0.000021 0.006384 10 20 0.094288 232 2 ONNX
1 0.261380 0.265717 0.096757 0.488434 10 15 5.227596 232 2 SKL
2 0.001372 0.005579 0.000015 0.004209 10 20 0.041167 232 3 ONNX
3 0.166880 0.215345 0.072899 0.310811 10 15 5.006405 232 3 SKL
4 0.000031 0.000229 0.000012 0.000203 10 20 0.001248 232 4 ONNX


.. GENERATED FROM PYTHON SOURCE LINES 135-136 Graph. .. GENERATED FROM PYTHON SOURCE LINES 136-145 .. code-block:: default fig, ax = plt.subplots(1, 2, figsize=(10, 4)) df[df.RT == 'ONNX'].set_index('N')[num].plot(ax=ax[0]) ax[0].set_title("Average ONNX prediction time per observation in a batch.") df[df.RT == 'SKL'].set_index('N')[num].plot(ax=ax[1]) ax[1].set_title( "Average scikit-learn prediction time\nper observation in a batch.") .. image-sg:: /gyexamples/images/sphx_glr_plot_parallelism_001.png :alt: Average ONNX prediction time per observation in a batch., Average scikit-learn prediction time per observation in a batch. :srcset: /gyexamples/images/sphx_glr_plot_parallelism_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Average scikit-learn prediction time\nper observation in a batch.') .. GENERATED FROM PYTHON SOURCE LINES 146-154 Gain from parallelization +++++++++++++++++++++++++ There is a clear gap between after and before 10 observations when it is parallelized. Does this threshold depends on the number of trees in the model? For that we compute for each model the average prediction time up to 10 and from 10 to 20. .. GENERATED FROM PYTHON SOURCE LINES 154-166 .. code-block:: default def parallized_gain(df): df = df[df.RT == 'ONNX'] df10 = df[df.N <= 10] t10 = sum(df10['average']) / df10.shape[0] df10p = df[df.N > 10] t10p = sum(df10p['average']) / df10p.shape[0] return t10 / t10p print('gain', parallized_gain(df)) .. rst-class:: sphx-glr-script-out .. code-block:: none gain 2.8027269425105525 .. GENERATED FROM PYTHON SOURCE LINES 167-175 Measures based on the number of trees +++++++++++++++++++++++++++++++++++++ We trained many models with different number of trees to see how the parallelization gain is moving. One models is trained for every distinct number of trees and then the prediction time is measured for different number of observations. .. GENERATED FROM PYTHON SOURCE LINES 175-179 .. code-block:: default tries_set = [2, 5, 8] + list(range(10, 50, 5)) + list(range(50, 101, 10)) tries = [(nb, N) for N in range(2, 21, 2) for nb in tries_set] .. GENERATED FROM PYTHON SOURCE LINES 180-181 training .. GENERATED FROM PYTHON SOURCE LINES 181-191 .. code-block:: default models = {100: (hgb, oinf)} for nb in tqdm(set(_[0] for _ in tries)): if nb not in models: hgb = HistGradientBoostingRegressor(max_iter=nb, max_depth=6) hgb.fit(X_train, y_train) onx = to_onnx(hgb, X_train[:1].astype(numpy.float32)) oinf = OnnxInference(onx, runtime='python_compiled') models[nb] = (hgb, oinf) .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/17 [00:00
average deviation min_exec max_exec repeat number ttime context_size N nb RT
0 0.000012 4.704016e-07 0.000012 0.000013 10 50 0.000238 232 2 2 ONNX
1 0.000012 2.464508e-07 0.000012 0.000012 10 50 0.000238 232 2 5 ONNX
2 0.000012 2.467387e-07 0.000012 0.000012 10 50 0.000244 232 2 8 ONNX
3 0.000012 2.223978e-07 0.000012 0.000013 10 50 0.000247 232 2 10 ONNX
4 0.000013 3.545282e-07 0.000013 0.000013 10 50 0.000259 232 2 15 ONNX


.. GENERATED FROM PYTHON SOURCE LINES 214-215 Let's compute the gains. .. GENERATED FROM PYTHON SOURCE LINES 215-225 .. code-block:: default gains = [] for nb in set(df['nb']): gain = parallized_gain(df[df.nb == nb]) gains.append(dict(nb=nb, gain=gain)) dfg = DataFrame(gains) dfg = dfg.sort_values('nb').reset_index(drop=True).copy() dfg .. raw:: html
nb gain
0 2 3.340066
1 5 3.061258
2 8 2.817793
3 10 2.718420
4 15 2.464326
5 20 2.270308
6 25 2.077236
7 30 1.985125
8 35 1.869208
9 40 1.780200
10 45 1.722022
11 50 1.682137
12 60 1.578177
13 70 2.069459
14 80 3.819747
15 90 4.320251
16 100 2.016841


.. GENERATED FROM PYTHON SOURCE LINES 226-227 Graph. .. GENERATED FROM PYTHON SOURCE LINES 227-232 .. code-block:: default ax = dfg.set_index('nb').plot() ax.set_title( "Parallelization gain depending\non the number of trees\n(max_depth=6).") .. image-sg:: /gyexamples/images/sphx_glr_plot_parallelism_002.png :alt: Parallelization gain depending on the number of trees (max_depth=6). :srcset: /gyexamples/images/sphx_glr_plot_parallelism_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Parallelization gain depending\non the number of trees\n(max_depth=6).') .. GENERATED FROM PYTHON SOURCE LINES 233-241 That does not answer the question we are looking for as we would like to know the best threshold *th* which defines the number of observations for which we should parallelized. This number depends on the number of trees. A gain > 1 means the parallization should happen Here, even two observations is ok. Let's check with lighter trees (``max_depth=2``), maybe in that case, the conclusion is different. .. GENERATED FROM PYTHON SOURCE LINES 241-269 .. code-block:: default models = {100: (hgb, oinf)} for nb in tqdm(set(_[0] for _ in tries)): if nb not in models: hgb = HistGradientBoostingRegressor(max_iter=nb, max_depth=2) hgb.fit(X_train, y_train) onx = to_onnx(hgb, X_train[:1].astype(numpy.float32)) oinf = OnnxInference(onx, runtime='python_compiled') models[nb] = (hgb, oinf) obs = [] for nb, N in tqdm(tries): hgb, oinf = models[nb] m = measure_time("oinf.run({'X': x})", {'oinf': oinf, 'x': X32[:N]}, div_by_number=True, number=50) m['N'] = N m['nb'] = nb m['RT'] = 'ONNX' obs.append(m) df = DataFrame(obs) num = ['min_exec', 'average', 'max_exec'] for c in num: df[c] /= df['N'] df.head() .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/17 [00:00
average deviation min_exec max_exec repeat number ttime context_size N nb RT
0 0.000012 2.962238e-07 0.000012 0.000012 10 50 0.000233 232 2 2 ONNX
1 0.000012 3.163281e-07 0.000012 0.000012 10 50 0.000234 232 2 5 ONNX
2 0.000012 3.331159e-07 0.000012 0.000012 10 50 0.000237 232 2 8 ONNX
3 0.000012 2.154113e-07 0.000012 0.000012 10 50 0.000238 232 2 10 ONNX
4 0.000012 2.503953e-07 0.000012 0.000012 10 50 0.000243 232 2 15 ONNX


.. GENERATED FROM PYTHON SOURCE LINES 270-271 Measures. .. GENERATED FROM PYTHON SOURCE LINES 271-281 .. code-block:: default gains = [] for nb in set(df['nb']): gain = parallized_gain(df[df.nb == nb]) gains.append(dict(nb=nb, gain=gain)) dfg = DataFrame(gains) dfg = dfg.sort_values('nb').reset_index(drop=True).copy() dfg .. raw:: html
nb gain
0 2 3.408700
1 5 3.276015
2 8 3.166396
3 10 3.066641
4 15 2.932275
5 20 2.807004
6 25 2.668917
7 30 2.583275
8 35 2.482075
9 40 2.400369
10 45 2.317305
11 50 2.234694
12 60 2.093900
13 70 1.938574
14 80 10.029273
15 90 0.981693
16 100 6.387053


.. GENERATED FROM PYTHON SOURCE LINES 282-283 Graph. .. GENERATED FROM PYTHON SOURCE LINES 283-288 .. code-block:: default ax = dfg.set_index('nb').plot() ax.set_title( "Parallelization gain depending\non the number of trees\n(max_depth=3).") .. image-sg:: /gyexamples/images/sphx_glr_plot_parallelism_003.png :alt: Parallelization gain depending on the number of trees (max_depth=3). :srcset: /gyexamples/images/sphx_glr_plot_parallelism_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Parallelization gain depending\non the number of trees\n(max_depth=3).') .. GENERATED FROM PYTHON SOURCE LINES 289-300 The conclusion is somewhat the same but it shows that the bigger the number of trees is the bigger the gain is and under the number of cores of the processor. Moving the theshold +++++++++++++++++++ The last experiment consists in comparing the prediction time with or without parallelization for different number of observation. .. GENERATED FROM PYTHON SOURCE LINES 300-335 .. code-block:: default hgb = HistGradientBoostingRegressor(max_iter=40, max_depth=6) hgb.fit(X_train, y_train) onx = to_onnx(hgb, X_train[:1].astype(numpy.float32)) oinf = OnnxInference(onx, runtime='python_compiled') obs = [] for N in tqdm(list(range(2, 51))): oinf.sequence_[0].ops_.rt_.omp_N_ = 100 m = measure_time("oinf.run({'X': x})", {'oinf': oinf, 'x': X32[:N]}, div_by_number=True, number=20) m['N'] = N m['RT'] = 'ONNX' m['PARALLEL'] = False obs.append(m) oinf.sequence_[0].ops_.rt_.omp_N_ = 1 m = measure_time("oinf.run({'X': x})", {'oinf': oinf, 'x': X32[:N]}, div_by_number=True, number=50) m['N'] = N m['RT'] = 'ONNX' m['PARALLEL'] = True obs.append(m) df = DataFrame(obs) num = ['min_exec', 'average', 'max_exec'] for c in num: df[c] /= df['N'] df.head() .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/49 [00:00
average deviation min_exec max_exec repeat number ttime context_size N RT PARALLEL
0 0.000015 9.999655e-07 0.000015 0.000017 10 20 0.000306 232 2 ONNX False
1 0.000713 3.156275e-03 0.000018 0.005167 10 50 0.014254 232 2 ONNX True
2 0.000011 1.075459e-06 0.000011 0.000012 10 20 0.000341 232 3 ONNX False
3 0.001662 5.984540e-03 0.000013 0.004145 10 50 0.049870 232 3 ONNX True
4 0.000009 1.096691e-06 0.000009 0.000010 10 20 0.000377 232 4 ONNX False


.. GENERATED FROM PYTHON SOURCE LINES 336-337 Graph. .. GENERATED FROM PYTHON SOURCE LINES 337-342 .. code-block:: default piv = df[['N', 'PARALLEL', 'average']].pivot('N', 'PARALLEL', 'average') ax = piv.plot(logy=True) ax.set_title("Prediction time with and without parallelization.") .. image-sg:: /gyexamples/images/sphx_glr_plot_parallelism_004.png :alt: Prediction time with and without parallelization. :srcset: /gyexamples/images/sphx_glr_plot_parallelism_004.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none somewhere/workspace/mlprodict/mlprodict_UT_39_std/_doc/examples/plot_parallelism.py:338: FutureWarning: In a future version of pandas all arguments of DataFrame.pivot will be keyword-only. piv = df[['N', 'PARALLEL', 'average']].pivot('N', 'PARALLEL', 'average') Text(0.5, 1.0, 'Prediction time with and without parallelization.') .. GENERATED FROM PYTHON SOURCE LINES 343-344 Parallelization is working. .. GENERATED FROM PYTHON SOURCE LINES 344-347 .. code-block:: default plt.show() .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 44 minutes 13.317 seconds) .. _sphx_glr_download_gyexamples_plot_parallelism.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_parallelism.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_parallelism.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_