.. _onnxprofileortrst: ========================== Profiling with onnxruntime ========================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/onnx_profile_ort.ipynb|*` The notebook profiles the execution of an ONNX graph built from a *KMeans* model and executed with *onnxruntime*. It then study the decomposition of one einsum equation into more simple operators. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline .. code:: ipython3 %load_ext mlprodict KMeans ------ Builds a KMeans ~~~~~~~~~~~~~~~ .. code:: ipython3 from sklearn.datasets import make_classification X, y = make_classification(100000) .. code:: ipython3 from sklearn.cluster import KMeans km = KMeans(max_iter=10) km.fit(X) .. parsed-literal:: KMeans(max_iter=10) .. code:: ipython3 import numpy from mlprodict.onnx_conv import to_onnx onx = to_onnx(km, X[:1].astype(numpy.float32)) .. code:: ipython3 %onnxview onx .. raw:: html

Json ~~~~ Another way to look into a model. .. code:: ipython3 from mlprodict.onnxrt import OnnxInference oinf = OnnxInference(onx) js = oinf.to_json() .. code:: ipython3 import json from io import StringIO from jyquickhelper import JSONJS JSONJS(json.load(StringIO(oinf.to_json()))) .. raw:: html

Profiling ~~~~~~~~~ .. code:: ipython3 from mlprodict.onnxrt import OnnxInference oinf = OnnxInference(onx, runtime="onnxruntime1", runtime_options={"enable_profiling": True}) .. code:: ipython3 for i in range(0, 111): oinf.run({"X": X.astype(numpy.float32)}) .. code:: ipython3 df = oinf.get_profiling(as_df=True) df .. raw:: html

	cat	pid	tid	dur	ts	ph	name	args_op_name	args_provider	args_graph_index	args_parameter_size	args_thread_scheduling_stats	args_exec_plan_index	args_activation_size	args_output_size
0	Session	106368	299276	596	12	X	model_loading_array	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Session	106368	299276	6925	670	X	session_initialization	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Node	106368	299276	1	34854	X	Re_ReduceSumSquare_fence_before	ReduceSumSquare	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Node	106368	299276	2939	34869	X	Re_ReduceSumSquare_kernel_time	ReduceSumSquare	CPUExecutionProvider	0	0	{'main_thread': {'thread_pool_name': 'session-...	0	8000000	400000
4	Node	106368	299276	0	37872	X	Re_ReduceSumSquare_fence_after	ReduceSumSquare	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2550	Node	106368	299276	0	2394227	X	Ar_ArgMin_fence_before	ArgMin	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2551	Node	106368	299276	3511	2394228	X	Ar_ArgMin_kernel_time	ArgMin	CPUExecutionProvider	5	0	{'main_thread': {'thread_pool_name': 'session-...	5	3200000	800000
2552	Node	106368	299276	0	2397761	X	Ar_ArgMin_fence_after	ArgMin	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2553	Session	106368	299276	11774	2385990	X	SequentialExecutor::Execute	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2554	Session	106368	299276	11789	2385982	X	model_run	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2555 rows × 15 columns

.. code:: ipython3 import matplotlib.pyplot as plt gr_dur = df[['dur', "args_op_name"]].groupby("args_op_name").sum().sort_values('dur') gr_n = df[['dur', "args_op_name"]].groupby("args_op_name").count().sort_values('dur') gr_n = gr_n.loc[gr_dur.index, :] fig, ax = plt.subplots(1, 2, figsize=(8, 4)) gr_dur.plot.barh(ax=ax[0]) gr_n.plot.barh(ax=ax[1]) ax[0].set_title("duration") ax[1].set_title("n occurences"); .. image:: onnx_profile_ort_17_0.png .. code:: ipython3 gr2 = df.loc[(df.args_op_name == 'Add') & (df.dur > 10), ['dur', "name"]].groupby("name").sum().sort_values('dur') gr2.plot.barh(figsize=(10, 4)); .. image:: onnx_profile_ort_18_0.png onnxruntime ~~~~~~~~~~~ .. code:: ipython3 from onnxruntime import InferenceSession, RunOptions, SessionOptions so = SessionOptions() so.enable_profiling = True sess = InferenceSession(onx.SerializeToString(), so) .. code:: ipython3 for i in range(0, 111): sess.run(None, {'X': X.astype(numpy.float32)}, ) .. code:: ipython3 prof = sess.end_profiling() prof .. parsed-literal:: 'onnxruntime_profile__2021-05-13_13-58-59.json' .. code:: ipython3 with open(prof, "r") as f: js = json.load(f) js[:3] .. parsed-literal:: [{'cat': 'Session', 'pid': 106368, 'tid': 299276, 'dur': 450, 'ts': 6, 'ph': 'X', 'name': 'model_loading_array', 'args': {}}, {'cat': 'Session', 'pid': 106368, 'tid': 299276, 'dur': 3068, 'ts': 498, 'ph': 'X', 'name': 'session_initialization', 'args': {}}, {'cat': 'Node', 'pid': 106368, 'tid': 299276, 'dur': 1, 'ts': 39069, 'ph': 'X', 'name': 'Re_ReduceSumSquare_fence_before', 'args': {'op_name': 'ReduceSumSquare'}}] .. code:: ipython3 from pandas import DataFrame from mlprodict.onnxrt.ops_whole.session import OnnxWholeSession df = DataFrame(OnnxWholeSession.process_profiling(js)) df .. raw:: html

	cat	pid	tid	dur	ts	ph	name	args_op_name	args_provider	args_graph_index	args_parameter_size	args_thread_scheduling_stats	args_exec_plan_index	args_activation_size	args_output_size
0	Session	106368	299276	450	6	X	model_loading_array	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Session	106368	299276	3068	498	X	session_initialization	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Node	106368	299276	1	39069	X	Re_ReduceSumSquare_fence_before	ReduceSumSquare	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Node	106368	299276	2804	39081	X	Re_ReduceSumSquare_kernel_time	ReduceSumSquare	CPUExecutionProvider	0	0	{'main_thread': {'thread_pool_name': 'session-...	0	8000000	400000
4	Node	106368	299276	0	41947	X	Re_ReduceSumSquare_fence_after	ReduceSumSquare	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2550	Node	106368	299276	0	2530548	X	Ar_ArgMin_fence_before	ArgMin	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2551	Node	106368	299276	3501	2530550	X	Ar_ArgMin_kernel_time	ArgMin	CPUExecutionProvider	5	0	{'main_thread': {'thread_pool_name': 'session-...	5	3200000	800000
2552	Node	106368	299276	0	2534074	X	Ar_ArgMin_fence_after	ArgMin	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2553	Session	106368	299276	14679	2519397	X	SequentialExecutor::Execute	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2554	Session	106368	299276	14701	2519386	X	model_run	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2555 rows × 15 columns

Einsum: ``bsnh,btnh->bnts`` --------------------------- This section looks into the ONNX graph produces by the decomposition of an einsum equation into more simple ONNX operator (no einsum). Three implementations ~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 from mlprodict.testing.einsum import einsum as onx_einsum from mlprodict.testing.einsum.einsum_fct import _einsum, enumerate_cached_einsum from numpy import einsum as np_einsum First classic numpy. .. code:: ipython3 equation = "bsnh,btnh->bnts" N = 2 inputs = [numpy.random.randn(N, N, N, N).astype(numpy.float32), numpy.random.randn(N, N, N, N).astype(numpy.float32)] np_einsum(equation, *inputs) .. parsed-literal:: array([[[[-2.373884 , -0.63942796], [ 1.0523143 , 5.659873 ]], [[ 2.589915 , -0.18050319], [-0.6200199 , 3.793615 ]]], [[[-0.37409338, 0.19822143], [ 1.2049038 , 3.1882448 ]], [[-0.05218329, 0.87404007], [ 0.12789296, 1.4745121 ]]]], dtype=float32) Then einsum executed by *onnxruntime*: .. code:: ipython3 onx_einsum(equation, *inputs, runtime='onnxruntime1', optimize=True, verbose=1, decompose=False) .. parsed-literal:: 0.0026 best='sbhn,sthn->shtb': 100%|██████████| 121/121 [00:01<00:00, 85.29it/s] .. parsed-literal:: array([[[[-2.373884 , -0.63942796], [ 1.0523144 , 5.659873 ]], [[ 2.589915 , -0.18050319], [-0.62002003, 3.793615 ]]], [[[-0.37409338, 0.19822143], [ 1.2049038 , 3.1882448 ]], [[-0.05218329, 0.87404007], [ 0.12789296, 1.474512 ]]]], dtype=float32) .. code:: ipython3 obj = _einsum(equation, runtime='onnxruntime1', optimize=True, verbose=1, decompose=False, dtype=inputs[0].dtype) .. code:: ipython3 %onnxview obj.onnx_ .. raw:: html

Same equation but decomposed. .. code:: ipython3 obj = _einsum(equation, runtime='onnxruntime1', optimize=True, verbose=1, decompose=True, dtype=inputs[0].dtype) .. parsed-literal:: 0.0025 best='hsnt,hbnt->hnbs': 100%|██████████| 121/121 [00:03<00:00, 34.54it/s] .. code:: ipython3 %onnxview obj.onnx_ .. raw:: html

.. code:: ipython3 onx_einsum(equation, *inputs, runtime='onnxruntime1', optimize=True, verbose=1) .. parsed-literal:: array([[[[-2.373884 , -0.63942796], [ 1.0523144 , 5.659873 ]], [[ 2.589915 , -0.18050319], [-0.62002003, 3.793615 ]]], [[[-0.37409338, 0.19822143], [ 1.2049038 , 3.1882448 ]], [[-0.05218329, 0.87404007], [ 0.12789296, 1.474512 ]]]], dtype=float32) First benchmark ~~~~~~~~~~~~~~~ .. code:: ipython3 N = 20 inputs = [numpy.random.randn(N, N, N, N).astype(numpy.float32), numpy.random.randn(N, N, N, N).astype(numpy.float32)] *numpy.einsum* .. code:: ipython3 %timeit numpy.einsum(equation, *inputs) .. parsed-literal:: 4.14 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) *onnxruntime einsum* .. code:: ipython3 %timeit onx_einsum(equation, *inputs, runtime='onnxruntime1', optimize=True, verbose=1, decompose=False) .. parsed-literal:: 736 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) *onnxruntime decomposed einsum* .. code:: ipython3 %timeit onx_einsum(equation, *inputs, runtime='onnxruntime1', optimize=True, verbose=1) .. parsed-literal:: 525 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Let’s disable the optimization to see the difference. The optimization goes through all the permutation of the letters of the equation and compares the computation time to find the best one. .. code:: ipython3 %timeit onx_einsum(equation, *inputs, runtime='onnxruntime1', optimize=False, verbose=1, decompose=False) .. parsed-literal:: 761 µs ± 46.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) It has no significant impact here but it has for the decomposition. The not optimized version is much slower. .. code:: ipython3 %timeit onx_einsum(equation, *inputs, runtime='onnxruntime1', optimize=False, verbose=1) .. parsed-literal:: 1.41 ms ± 43.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Profiling of the not optimized version ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let’s profile the graph obtained with the decomposition. .. code:: ipython3 obj = _einsum(equation, runtime='onnxruntime1', optimize=False, verbose=1, decompose=True, dtype=inputs[0].dtype) onx = obj.onnx_ .. code:: ipython3 obj.equation, obj.equation_ .. parsed-literal:: ('bsnh,btnh->bnts', 'bsnh,btnh->bnts') .. code:: ipython3 from mlprodict.onnxrt import OnnxInference oinf = OnnxInference(onx, runtime="onnxruntime1", runtime_options={"enable_profiling": True}) d_inputs = {'X0': inputs[0], 'X1': inputs[1]} for i in range(0, 100): oinf.run(d_inputs) df = oinf.get_profiling(as_df=True) df.head() .. raw:: html

	cat	pid	tid	dur	ts	ph	name	args_op_name	args_provider	args_graph_index	args_parameter_size	args_thread_scheduling_stats	args_exec_plan_index	args_activation_size	args_output_size
0	Session	106368	299276	705	4	X	model_loading_array	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Session	106368	299276	7019	987	X	session_initialization	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Node	106368	299276	1	8320	X	Unsqueeze3_2620928306480_fence_before	Unsqueeze	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Node	106368	299276	4	8327	X	Unsqueeze3_2620928306480_kernel_time	Unsqueeze	CPUExecutionProvider	4	8	{'main_thread': {'thread_pool_name': 'session-...	4	640000	640000
4	Node	106368	299276	0	8372	X	Unsqueeze3_2620928306480_fence_after	Unsqueeze	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	cat	pid	tid	dur	ts	ph	name	args_op_name	args_provider	args_graph_index	args_parameter_size	args_thread_scheduling_stats	args_exec_plan_index	args_activation_size	args_output_size
0	Session	106368	299276	1300	6	X	model_loading_array	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Session	106368	299276	7330	1720	X	session_initialization	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Node	106368	299276	1	9376	X	Unsqueeze3_2620928202160_fence_before	Unsqueeze	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Node	106368	299276	4	9383	X	Unsqueeze3_2620928202160_kernel_time	Unsqueeze	CPUExecutionProvider	4	8	{'main_thread': {'thread_pool_name': 'session-...	4	640000	640000
4	Node	106368	299276	0	9422	X	Unsqueeze3_2620928202160_fence_after	Unsqueeze	NaN	NaN	NaN	NaN	NaN	NaN	NaN

.. code:: ipython3 gr_dur = df[['dur', "args_op_name"]].groupby("args_op_name").sum().sort_values('dur') gr_n = df[['dur', "args_op_name"]].groupby("args_op_name").count().sort_values('dur') gr_n = gr_n.loc[gr_dur.index, :] fig, ax = plt.subplots(1, 2, figsize=(8, 4)) gr_dur.plot.barh(ax=ax[0]) gr_n.plot.barh(ax=ax[1]) ax[0].set_title("duration - optimized - %s" % obj.equation_) ax[1].set_title("n occurences"); .. image:: onnx_profile_ort_60_0.png onnxruntime was able to fuse MatMul with a transposition. That explains why it is faster. .. code:: ipython3 gr_dur = df[['dur', "args_op_name", "name"]].groupby(["args_op_name", "name"], as_index=False).sum().sort_values('dur') gr_dur .. raw:: html

	args_op_name	name	dur
0	Concat	Concat12_fence_after	0
24	Gather	Gather1_fence_after	0
25	Gather	Gather1_fence_before	0
27	Gather	Gather_fence_after	0
60	Transpose	Transpose02134_2620928192768_fence_after	0
...	...	...	...
56	Squeeze	Squeeze4_2620928194352_kernel_time	4339
59	Transpose	Transpose01324_2620928151024_kernel_time	8661
62	Transpose	Transpose02134_2620928192768_kernel_time	11487
65	Transpose	Transpose13024_2620928192816_kernel_time	13598
11	FusedMatMul	MatMul_With_Transpose_kernel_time	23847

72 rows × 3 columns

.. code:: ipython3 gr_dur[gr_dur.args_op_name == "Transpose"] .. raw:: html

	args_op_name	name	dur
60	Transpose	Transpose02134_2620928192768_fence_after	0
57	Transpose	Transpose01324_2620928151024_fence_after	0
61	Transpose	Transpose02134_2620928192768_fence_before	0
58	Transpose	Transpose01324_2620928151024_fence_before	1
64	Transpose	Transpose13024_2620928192816_fence_before	1
63	Transpose	Transpose13024_2620928192816_fence_after	3
59	Transpose	Transpose01324_2620928151024_kernel_time	8661
62	Transpose	Transpose02134_2620928192768_kernel_time	11487
65	Transpose	Transpose13024_2620928192816_kernel_time	13598

Let’s draw again the graph to see which transpose is is which. .. code:: ipython3 %onnxview onx .. raw:: html

The optimized looked into all permutations. We see that the letter ordering should be carefully chosen. .. code:: ipython3 import pandas df = pandas.DataFrame(obj.timed_permutations_, columns=["time", "equation"]) df = df.sort_values('time') df = df.set_index("equation") ax = df.plot.barh(figsize=(8, 25)) ax.set_title("%s OPTIMIZED INTO %s" % (obj.equation, obj.equation_)); .. image:: onnx_profile_ort_67_0.png