Time processing for every ONNX nodes in a graph#

Links: notebook, html, PDF, python, slides, GitHub

The following notebook show how long the runtime spends in each node of an ONNX graph.

from jyquickhelper import add_notebook_menu
add_notebook_menu()

%load_ext mlprodict

%matplotlib inline

LogisticRegression #

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = LogisticRegression(solver='liblinear')
clr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

import numpy
from mlprodict.onnx_conv import to_onnx
onx = to_onnx(clr, X_test.astype(numpy.float32))
with open("logreg_time.onnx", "wb") as f:
    f.write(onx.SerializeToString())
# add -l 1 if nothing shows up
%onnxview onx

from mlprodict.onnxrt import OnnxInference
import pandas
oinf = OnnxInference(onx)
res = oinf.run({'X': X_test}, node_time=True)
pandas.DataFrame(list(res[1]))

	i	name	op_type	time
0	0	LinearClassifier	LinearClassifier	0.199603
1	1	Normalizer	Normalizer	0.000091
2	2	Cast	Cast	0.000014
3	3	ZipMap	ZipMap	0.000016

oinf.run({'X': X_test})['output_probability'][:5]

{0: array([8.38235830e-01, 1.21554664e-03, 6.97352537e-04, 7.93823160e-01,
        9.24825077e-01]),
 1: array([0.16162989, 0.39692812, 0.25688601, 0.20607722, 0.07516498]),
 2: array([1.34279470e-04, 6.01856333e-01, 7.42416637e-01, 9.96200831e-05,
        9.94208860e-06])}

Measure time spent in each node #

With parameter node_time=True, method run returns the output and time measurement.

exe = oinf.run({'X': X_test}, node_time=True)
exe[1]

[{'i': 0,
  'name': 'LinearClassifier',
  'op_type': 'LinearClassifier',
  'time': 0.00015699999999974068},
 {'i': 1,
  'name': 'Normalizer',
  'op_type': 'Normalizer',
  'time': 5.43000000003957e-05},
 {'i': 2, 'name': 'Cast', 'op_type': 'Cast', 'time': 1.1699999999947863e-05},
 {'i': 3,
  'name': 'ZipMap',
  'op_type': 'ZipMap',
  'time': 1.940000000111297e-05}]

import pandas
pandas.DataFrame(exe[1])

	i	name	op_type	time
0	0	LinearClassifier	LinearClassifier	0.000157
1	1	Normalizer	Normalizer	0.000054
2	2	Cast	Cast	0.000012
3	3	ZipMap	ZipMap	0.000019

Logistic regression: python runtime vs onnxruntime #

Function enumerate_validated_operator_opsets implements automated tests for every model with artificial data. Option node_time automatically returns the time spent in each node and does it multiple time.

from mlprodict.onnxrt.validate import enumerate_validated_operator_opsets
res = list(enumerate_validated_operator_opsets(
            verbose=0, models={"LogisticRegression"}, opset_min=12,
            runtime='python', debug=False, node_time=True,
            filter_exp=lambda m, p: p == "b-cl"))

C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))

import pandas
df = pandas.DataFrame(res[0]['bench-batch'])
df['step'] = df.apply(lambda row: '{}-{}'.format(row['i'], row["name"]), axis=1)
df

	i	name	op_type	time	N	max_time	min_time	repeat	number	step
0	0	LinearClassifier	LinearClassifier	0.000018	1	0.000033	0.000015	20	30	0-LinearClassifier
1	1	Normalizer	Normalizer	0.000017	1	0.000069	0.000012	20	30	1-Normalizer
2	2	Cast	Cast	0.000004	1	0.000009	0.000003	20	30	2-Cast
3	3	ZipMap	ZipMap	0.000005	1	0.000007	0.000004	20	30	3-ZipMap
4	0	LinearClassifier	LinearClassifier	0.000020	10	0.000052	0.000017	20	20	0-LinearClassifier
5	1	Normalizer	Normalizer	0.000015	10	0.000035	0.000013	20	20	1-Normalizer
6	2	Cast	Cast	0.000004	10	0.000013	0.000003	20	20	2-Cast
7	3	ZipMap	ZipMap	0.000004	10	0.000008	0.000004	20	20	3-ZipMap
8	0	LinearClassifier	LinearClassifier	0.000024	100	0.000036	0.000019	10	8	0-LinearClassifier
9	1	Normalizer	Normalizer	0.000018	100	0.000023	0.000015	10	8	1-Normalizer
10	2	Cast	Cast	0.000004	100	0.000006	0.000003	10	8	2-Cast
11	3	ZipMap	ZipMap	0.000007	100	0.000005	0.000004	10	8	3-ZipMap
12	0	LinearClassifier	LinearClassifier	0.000051	1000	0.000057	0.000047	5	5	0-LinearClassifier
13	1	Normalizer	Normalizer	0.000041	1000	0.000045	0.000040	5	5	1-Normalizer
14	2	Cast	Cast	0.000003	1000	0.000004	0.000003	5	5	2-Cast
15	3	ZipMap	ZipMap	0.000004	1000	0.000004	0.000004	5	5	3-ZipMap
16	0	LinearClassifier	LinearClassifier	0.000315	10000	0.000328	0.000315	3	3	0-LinearClassifier
17	1	Normalizer	Normalizer	0.000272	10000	0.000284	0.000256	3	3	1-Normalizer
18	2	Cast	Cast	0.000004	10000	0.000004	0.000004	3	3	2-Cast
19	3	ZipMap	ZipMap	0.000004	10000	0.000004	0.000004	3	3	3-ZipMap
20	0	LinearClassifier	LinearClassifier	0.005634	100000	0.005634	0.005634	1	2	0-LinearClassifier
21	1	Normalizer	Normalizer	0.004671	100000	0.004671	0.004671	1	2	1-Normalizer
22	2	Cast	Cast	0.000024	100000	0.000024	0.000024	1	2	2-Cast
23	3	ZipMap	ZipMap	0.000013	100000	0.000013	0.000013	1	2	3-ZipMap

Following tables shows the time spent in each node, it is relative to the total time. For one observation, the runtime spends 10% of the time in ZipMap, it is only 1% or 2% with 10 observations. These proportions change due to the computing cost of each node.

piv = df.pivot('step', 'N', 'time')
total = piv.sum(axis=0)
piv / total

N	1	10	100	1000	10000	100000
step
0-LinearClassifier	0.410138	0.459103	0.450882	0.512622	0.530490	0.544785
1-Normalizer	0.390060	0.353622	0.350126	0.414227	0.456671	0.451642
2-Cast	0.095729	0.089857	0.074343	0.034398	0.006092	0.002306
3-ZipMap	0.104073	0.097418	0.124649	0.038753	0.006747	0.001267

The python implementation of ZipMap does not change the data but wraps in into a frozen class ArrayZipMapDitionary which mocks a list of dictionaries pandas can ingest to create a DataFrame. The cost is a fixed cost and does not depend on the number of processed rows.

from pyquickhelper.pycode.profiling import profile
bigX = numpy.random.randn(100000, X_test.shape[1]).astype(numpy.float32)
print(profile(lambda: oinf.run({'X': bigX}), pyinst_format="text")[1])

  _     ._   __/__   _ _  _  _ _/_   Recorded: 00:28:08  Samples:  4
 /_//_/// /_/ //_// / //_'/ //     Duration: 0.009     CPU time: 0.031
/   _/                      v3.0.1
Program: c:python372_x64libsite-packagesipykernel_launcher.py -f C:UsersxavieAppDataRoamingjupyterruntimekernel-287476aa-b8ba-4140-902a-b0aad833ffd0.json
0.008 profile  pyquickhelperpycodeprofiling.py:49
`- 0.008 <lambda>  <ipython-input-13-ccd42692a7ed>:3
   `- 0.008 run  mlprodictonnxrtonnx_inference.py:475
      `- 0.008 _run_sequence_runtime  mlprodictonnxrtonnx_inference.py:558
         `- 0.008 run  mlprodictonnxrtonnx_inference_node.py:141
            |- 0.005 run  mlprodictonnxrtops_cpu_op.py:417
            |  `- 0.005 run  mlprodictonnxrtops_cpu_op.py:298
            |     `- 0.005 _run  mlprodictonnxrtops_cpuop_linear_classifier.py:40
            |        |- 0.003 [self]
            |        `- 0.002 argmax  <__array_function__ internals>:2
            |           `- 0.002 argmax  numpycorefromnumeric.py:1112
            |                 [3 frames hidden]  numpy
            |                    0.002 _wrapfunc  numpycorefromnumeric.py:55
            `- 0.003 run  mlprodictonnxrtops_cpu_op.py:383
               `- 0.003 run  mlprodictonnxrtops_cpu_op.py:298
                  `- 0.003 _run  mlprodictonnxrtops_cpuop_normalizer.py:66
                     `- 0.003 norm_l1  mlprodictonnxrtops_cpuop_normalizer.py:42
                        `- 0.003 _norm_L1_inplace  mlprodictonnxrtops_cpuop_normalizer.py:49
                           |- 0.002 [self]
                           `- 0.002 _sum  numpycore_methods.py:36
                                 [2 frames hidden]  numpy

The class ArrayZipMapDictionary is fast to build but has an overhead after that because it builds data when needed.

res = oinf.run({'X': bigX})
prob = res['output_probability']
type(prob)

mlprodict.onnxrt.ops_cpu.op_zipmap.ArrayZipMapDictionary

%timeit pandas.DataFrame(prob)

721 ms ± 54.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

list_of_dict = [v.asdict() for v in prob]
%timeit pandas.DataFrame(list_of_dict)

108 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But if you just need to do the following:

%timeit pandas.DataFrame(prob).values

713 ms ± 56.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then, you can just do that:

print(prob.columns)
%timeit prob.values

[0, 1, 2]
390 ns ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

And then:

%timeit -n 100 pandas.DataFrame(prob.values, columns=prob.columns)

215 µs ± 82.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can then compare to what onnxruntime would do when the runtime is called indenpently for each node. We use the runtime named onnxruntime2. Class OnnxInference splits the ONNX graph into multiple ONNX graphs, one for each node, and then calls onnxruntime for each of them indenpently. Python handles the graph logic.

res = list(enumerate_validated_operator_opsets(
            verbose=0, models={"LogisticRegression"}, opset_min=12,
            runtime='onnxruntime2', debug=False, node_time=True))

C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))

res0 = None
for i, r in enumerate(res):
    if "available-ERROR" in r:
        print(i, str(r['available-ERROR']).split("\n")[0])
    elif res0 is None:
        res0 = r

Unable to load node 'ZipMap' (output type was inferred)
Unable to load node 'ZipMap' (output type was inferred)
Unable to load node 'LinearClassifier' (output type was guessed)
Unable to load node 'LinearClassifier' (output type was guessed)
Unable to load node 'LinearClassifier' (output type was guessed)
Unable to load node 'LinearClassifier' (output type was guessed)
Unable to load node 'ZipMap' (output type was inferred)
Unable to load node 'ZipMap' (output type was inferred)

if '_6ort_run_batch_exc' in res[0]:
    m = "Something went wrong.", res[0]['_6ort_run_batch_exc']
else:
    df = pandas.DataFrame(res0['bench-batch'])
    print(df)
    df['step'] = df.apply(lambda row: '{}-{}'.format(row['i'], row["name"]), axis=1)
    piv = df.pivot('step', 'N', 'time')
    total = piv.sum(axis=0)
    m = piv / total
m

   i              name           op_type      time       N  max_time  0  0  LinearClassifier  LinearClassifier  0.000052       1  0.000190
1  0  LinearClassifier  LinearClassifier  0.000044      10  0.000070
2  0  LinearClassifier  LinearClassifier  0.000071     100  0.000133
3  0  LinearClassifier  LinearClassifier  0.000066    1000  0.000079
4  0  LinearClassifier  LinearClassifier  0.000409   10000  0.000408
5  0  LinearClassifier  LinearClassifier  0.003275  100000  0.003275
   min_time  repeat  number
0  0.000028      20      30
1  0.000031      20      20
2  0.000046      10       8
3  0.000057       5       5
4  0.000365       3       3
5  0.003275       1       2

N	1	10	100	1000	10000	100000
step
0-LinearClassifier	1.0	1.0	1.0	1.0	1.0	1.0

onnxruntime creates a new container each time a ZipMap is executed. That’s whay it takes that much time and the ratio increases when the number of observations increases.

GaussianProcessRegressor #

This operator is slow for small batches compare to scikit-learn but closes the gap as the batch size increases. Let’s see where the time goes.

from onnx.defs import onnx_opset_version
from mlprodict.tools.asv_options_helper import get_opset_number_from_onnx
onnx_opset_version(), get_opset_number_from_onnx()

(12, 12)

res = list(enumerate_validated_operator_opsets(
            verbose=1, models={"GaussianProcessRegressor"},
            opset_min=get_opset_number_from_onnx(),
            opset_max=get_opset_number_from_onnx(),
            runtime='python', debug=False, node_time=True,
            filter_exp=lambda m, p: p == "b-reg"))

[enumerate_validated_operator_opsets] opset in [12, 12].

GaussianProcessRegressor    :   0%|          | 0/1 [00:00<?, ?it/s]

[enumerate_compatible_opset] opset in [12, 12].

GaussianProcessRegressor    : 100%|██████████| 1/1 [00:05<00:00,  5.66s/it]

res0 = None
for i, r in enumerate(res):
    if "available-ERROR" in r:
        print(i, str(r['available-ERROR']).split("\n")[0])
    elif res0 is None:
        res0 = r

df = pandas.DataFrame(res0['bench-batch'])
df['step'] = df.apply(lambda row: '{0:02d}-{1}'.format(row['i'], row["name"]), axis=1)
df.head()

	i	name	op_type	time	N	max_time	min_time	repeat	number	step
0	0	kgpd_CDist	CDist	0.000033	1	0.000045	0.000027	20	30	00-kgpd_CDist
1	1	kgpd_Div	Div	0.000009	1	0.000016	0.000007	20	30	01-kgpd_Div
2	2	kgpd_Mul	Mul	0.000006	1	0.000007	0.000005	20	30	02-kgpd_Mul
3	3	kgpd_Sin	Sin	0.000007	1	0.000009	0.000006	20	30	03-kgpd_Sin
4	4	kgpd_Div1	Div	0.000007	1	0.000008	0.000005	20	30	04-kgpd_Div1

pivpy = df.pivot('step', 'N', 'time')
total = pivpy.sum(axis=0)
pivpy / total

N	1	10	100	1000	10000	100000
step
00-kgpd_CDist	0.311496	0.300665	0.244035	0.227984	0.264447	0.288546
01-kgpd_Div	0.082535	0.067193	0.028348	0.011667	0.012230	0.015447
02-kgpd_Mul	0.059840	0.050664	0.018670	0.006959	0.010950	0.012468
03-kgpd_Sin	0.067037	0.086529	0.106165	0.113068	0.102102	0.107563
04-kgpd_Div1	0.061852	0.053088	0.025749	0.010935	0.009810	0.009875
05-kgpd_Pow	0.072520	0.166539	0.361318	0.438253	0.418169	0.404182
06-kgpd_Mul1	0.057508	0.050477	0.020386	0.010334	0.009079	0.010466
07-kgpd_Exp	0.067885	0.098850	0.150876	0.168079	0.165177	0.145106
08-gpr_MatMul	0.137546	0.064570	0.025069	0.009029	0.007134	0.006159
09-gpr_Add	0.081782	0.061424	0.019383	0.003692	0.000903	0.000190

ax = (pivpy / total).T.plot(logx=True, figsize=(14, 4))
ax.set_ylim([0,1])
ax.set_title("Time spent in each node relatively to the total time\npython runtime");

The operator Scan is clearly time consuming when the batch size is small. onnxruntime is more efficient for this one.

res = list(enumerate_validated_operator_opsets(
            verbose=1, models={"GaussianProcessRegressor"},
            opset_min=get_opset_number_from_onnx(),
            opset_max=get_opset_number_from_onnx(),
            runtime='onnxruntime2', debug=False, node_time=True,
            filter_exp=lambda m, p: p == "b-reg"))

[enumerate_validated_operator_opsets] opset in [12, 12].

GaussianProcessRegressor    :   0%|          | 0/1 [00:00<?, ?it/s]

[enumerate_compatible_opset] opset in [12, 12].

GaussianProcessRegressor    : 100%|██████████| 1/1 [00:06<00:00,  6.84s/it]

try:
    df = pandas.DataFrame(res[0]['bench-batch'])
except KeyError as e:
    print("No model available.")
    r, df = None, None
if df is not None:
    df['step'] = df.apply(lambda row: '{0:02d}-{1}'.format(row['i'], row["name"]), axis=1)
    pivort = df.pivot('step', 'N', 'time')
    total = pivort.sum(axis=0)
    r = pivort / total
r

N	1	10	100	1000	10000	100000
step
00-kgpd_CDist	0.114001	0.120884	0.128845	0.148042	0.156776	0.180377
01-kgpd_Div	0.101792	0.098622	0.085983	0.085108	0.086603	0.084520
02-kgpd_Mul	0.099980	0.097547	0.084001	0.064706	0.072849	0.081023
03-kgpd_Sin	0.089632	0.103505	0.194194	0.301002	0.245769	0.260717
04-kgpd_Div1	0.099119	0.096737	0.088709	0.063237	0.095840	0.091635
05-kgpd_Pow	0.108045	0.098307	0.081161	0.064898	0.079015	0.076962
06-kgpd_Mul1	0.098561	0.098475	0.082770	0.063557	0.087732	0.076762
07-kgpd_Exp	0.090019	0.087015	0.086282	0.088542	0.103798	0.087690
08-gpr_MatMul	0.100426	0.102766	0.106220	0.102751	0.069157	0.059617
09-gpr_Add	0.098425	0.096143	0.061836	0.018155	0.002462	0.000696

if r is not None:
    ax = (pivort / total).T.plot(logx=True, figsize=(14, 4))
    ax.set_ylim([0,1])
    ax.set_title("Time spent in each node relatively to the total time\nonnxtunime");

The results are relative. Let’s see which runtime is best node by node.

if r is not None:
    r = (pivort - pivpy) / pivpy
r

N	1	10	100	1000	10000	100000
step
00-kgpd_CDist	-0.239113	-0.367743	-0.630420	-0.703226	-0.677041	-0.631775
01-kgpd_Div	1.564119	1.308106	1.123155	2.333824	2.857590	2.223117
02-kgpd_Mul	2.473648	2.027773	2.149388	3.249448	2.624165	2.828025
03-kgpd_Sin	1.779780	0.881090	0.280401	0.216680	0.311288	0.427768
04-kgpd_Div1	2.331710	1.865534	1.411570	1.642922	4.321977	4.466249
05-kgpd_Pow	2.097502	-0.071724	-0.842765	-0.932321	-0.897065	-0.887837
06-kgpd_Mul1	2.563218	2.067909	1.842112	1.811019	4.263897	3.320524
07-kgpd_Exp	1.756911	0.384288	-0.599693	-0.759241	-0.657668	-0.644029
08-gpr_MatMul	0.517953	1.502798	1.965953	4.201281	4.281286	4.701255
09-gpr_Add	1.502111	1.461452	1.233097	1.247736	0.486358	1.160395

Based on this, onnxruntime is faster for operators Scan, Pow, Exp and slower for all the others.

Measuring the time with a custom dataset #

We use the example Comparison of kernel ridge and Gaussian process regression.

import numpy
import pandas
import matplotlib.pyplot as plt
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared

rng = numpy.random.RandomState(0)

# Generate sample data
X = 15 * rng.rand(100, 1)
y = numpy.sin(X).ravel()
y += 3 * (0.5 - rng.rand(X.shape[0]))  # add noise

gp_kernel = ExpSineSquared(1.0, 5.0, periodicity_bounds=(1e-2, 1e1))
gpr = GaussianProcessRegressor(kernel=gp_kernel)
gpr.fit(X, y)

C:xavierdupre__home_github_forkscikit-learnsklearngaussian_processkernels.py:409: ConvergenceWarning: The optimal value found for dimension 0 of parameter length_scale is close to the specified lower bound 1e-05. Decreasing the bound and calling fit again may find a better value.
  ConvergenceWarning)
C:xavierdupre__home_github_forkscikit-learnsklearngaussian_processkernels.py:418: ConvergenceWarning: The optimal value found for dimension 0 of parameter periodicity is close to the specified upper bound 10.0. Increasing the bound and calling fit again may find a better value.
  ConvergenceWarning)

GaussianProcessRegressor(kernel=ExpSineSquared(length_scale=1, periodicity=5))

onx = to_onnx(gpr, X_test.astype(numpy.float64))
with open("gpr_time.onnx", "wb") as f:
    f.write(onx.SerializeToString())
%onnxview onx -r 1

from mlprodict.tools import get_ir_version_from_onnx
onx.ir_version = get_ir_version_from_onnx()

oinfpy = OnnxInference(onx, runtime="python")
oinfort = OnnxInference(onx, runtime="onnxruntime2")

runtime==onnxruntime2 tells the class OnnxInference to use onnxruntime for every node independently, there are as many calls as there are nodes in the graph.

respy = oinfpy.run({'X': X_test}, node_time=True)
try:
    resort = oinfort.run({'X': X_test}, node_time=True)
except Exception as e:
    print(e)
    resort = None

if resort is not None:
    df = pandas.DataFrame(respy[1]).merge(pandas.DataFrame(resort[1]), on=["i", "name", "op_type"],
                                        suffixes=("_py", "_ort"))
    df['delta'] = df.time_ort - df.time_py
else:
    df = None
df

	i	name	op_type	time_py	time_ort	delta
0	0	Sc_Scan	Scan	0.007998	0.005970	-0.002028
1	1	kgpd_Transpose	Transpose	0.000032	0.000599	0.000567
2	2	kgpd_Sqrt	Sqrt	0.000063	0.000112	0.000049
3	3	kgpd_Div	Div	0.000143	0.000097	-0.000045
4	4	kgpd_Mul	Mul	0.000038	0.000321	0.000283
5	5	kgpd_Sin	Sin	0.000095	0.000146	0.000051
6	6	kgpd_Div1	Div	0.000027	0.000096	0.000069
7	7	kgpd_Pow	Pow	0.000299	0.000104	-0.000196
8	8	kgpd_Mul1	Mul	0.000032	0.000097	0.000065
9	9	kgpd_Exp	Exp	0.000383	0.000111	-0.000271
10	10	gpr_MatMul	MatMul	0.000080	0.004359	0.004279
11	11	gpr_Add	Add	0.000034	0.000165	0.000131

The following function runs multiple the same inference and aggregates the results node by node.

from mlprodict.onnxrt.validate.validate import benchmark_fct
res = benchmark_fct(lambda X: oinfpy.run({'X': X_test}, node_time=True),
                    X_test, node_time=True)

df = pandas.DataFrame(res)
df[df.N == 100]

	i	name	op_type	time	N	max_time	min_time	repeat	number
24	0	Sc_Scan	Scan	0.004154	100	0.004330	0.003843	10	8
25	1	kgpd_Transpose	Transpose	0.000013	100	0.000019	0.000010	10	8
26	2	kgpd_Sqrt	Sqrt	0.000018	100	0.000022	0.000015	10	8
27	3	kgpd_Div	Div	0.000025	100	0.000092	0.000015	10	8
28	4	kgpd_Mul	Mul	0.000012	100	0.000019	0.000009	10	8
29	5	kgpd_Sin	Sin	0.000057	100	0.000070	0.000050	10	8
30	6	kgpd_Div1	Div	0.000014	100	0.000017	0.000011	10	8
31	7	kgpd_Pow	Pow	0.000172	100	0.000198	0.000155	10	8
32	8	kgpd_Mul1	Mul	0.000020	100	0.000101	0.000009	10	8
33	9	kgpd_Exp	Exp	0.000213	100	0.000249	0.000193	10	8
34	10	gpr_MatMul	MatMul	0.000034	100	0.000047	0.000026	10	8
35	11	gpr_Add	Add	0.000013	100	0.000019	0.000011	10	8

df100 = df[df.N == 100]

%matplotlib inline

fig, ax = plt.subplots(1, 1, figsize=(14, 4))
ax.bar(df100.i, df100.time, align='center', color='orange')
ax.set_xticks(df100.i)
ax.set_yscale('log')
ax.set_xticklabels(df100.op_type)
ax.errorbar(df100.i, df100.time,
            numpy.abs(df100[["min_time", "max_time"]].T.values - df100.time.values.ravel()),
            uplims=True, lolims=True, color='blue')
ax.set_title("Time spent in each node for 100 observations\nGaussianProcess");

df100c = df100.cumsum()

fig, ax = plt.subplots(1, 1, figsize=(14, 4))
ax.bar(df100.i, df100c.time, align='center', color='orange')
ax.set_xticks(df100.i)
#ax.set_yscale('log')
ax.set_ylim([df100c.min_time.min(), df100c.max_time.max()])
ax.set_xticklabels(df100.op_type)
ax.errorbar(df100.i, df100c.time,
            numpy.abs((df100c[["min_time", "max_time"]].T.values - df100c.time.values.ravel())),
            uplims=True, lolims=True)
ax.set_title("Cumulated time spent in each node for 100 observations\nGaussianProcess");

onnxruntime2 / onnxruntime1 #

The runtime onnxruntime1 uses onnxruntime for the whole ONNX graph. There is no way to get the computation time for each node except if we create a ONNX graph for each intermediate node.

oinfort1 = OnnxInference(onx, runtime='onnxruntime1')

split = oinfort1.build_intermediate()
split

OrderedDict([('scan0', OnnxInference(...)),
             ('scan1', OnnxInference(...)),
             ('kgpd_transposed0', OnnxInference(...)),
             ('kgpd_Y0', OnnxInference(...)),
             ('kgpd_C03', OnnxInference(...)),
             ('kgpd_C02', OnnxInference(...)),
             ('kgpd_output02', OnnxInference(...)),
             ('kgpd_C01', OnnxInference(...)),
             ('kgpd_Z0', OnnxInference(...)),
             ('kgpd_C0', OnnxInference(...)),
             ('kgpd_output01', OnnxInference(...)),
             ('gpr_Y0', OnnxInference(...)),
             ('GPmean', OnnxInference(...))])

dfs = []
for k, v in split.items():
    print("node", k)
    res = benchmark_fct(lambda x: v.run({'X': x}), X_test)
    df = pandas.DataFrame(res)
    df['name'] = k
    dfs.append(df.reset_index(drop=False))

node scan0
node scan1
node kgpd_transposed0
node kgpd_Y0
node kgpd_C03
node kgpd_C02
node kgpd_output02
node kgpd_C01
node kgpd_Z0
node kgpd_C0
node kgpd_output01
node gpr_Y0
node GPmean

df = pandas.concat(dfs)
df.head()

	index	1	10	100	1000	10000	100000	name
0	average	0.000623	0.000592	0.000754	0.002202	0.017529	0.201192	scan0
1	deviation	0.000115	0.000030	0.000034	0.000026	0.000976	0.000000	scan0
2	min_exec	0.000541	0.000537	0.000657	0.002169	0.016677	0.201192	scan0
3	max_exec	0.000980	0.000639	0.000780	0.002239	0.018896	0.201192	scan0
4	repeat	20.000000	20.000000	10.000000	5.000000	3.000000	1.000000	scan0

df100c = df[df['index'] == "average"]
df100c_min = df[df['index'] == "min_exec"]
df100c_max = df[df['index'] == "max_exec"]
ave = df100c.iloc[:, 4]
ave_min = df100c_min.iloc[:, 4]
ave_max = df100c_max.iloc[:, 4]
ave.shape, ave_min.shape, ave_max.shape
index = numpy.arange(ave.shape[0])

fig, ax = plt.subplots(1, 1, figsize=(14, 4))
ax.bar(index, ave, align='center', color='orange')
ax.set_xticks(index)
ax.set_xticklabels(df100c.name)
for tick in ax.get_xticklabels():
    tick.set_rotation(20)
ax.errorbar(index, ave,
            numpy.abs((numpy.vstack([ave_min.values, ave_max.values]) - ave.values.ravel())),
            uplims=True, lolims=True)
ax.set_title("Cumulated time spent in each node for 100 "
             "observations\nGaussianProcess and onnxruntime1");

The visual graph helps matching the output names with the operator type. The curve is not monotononic because each experiment computes every output from the start. The number of repetitions should be increased. Documentation of function benchmark_fct tells how to do it.

Time processing for every ONNX nodes in a graph#

LogisticRegression#

Measure time spent in each node#

Logistic regression: python runtime vs onnxruntime#

GaussianProcessRegressor#

Measuring the time with a custom dataset#

onnxruntime2 / onnxruntime1#