Time processing for every ONNX nodes in a graph

Links: notebook, html, PDF, python, slides, GitHub

The following notebook show how long the runtime spends in each node of an ONNX graph.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%load_ext mlprodict
%matplotlib inline

LogisticRegression

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = LogisticRegression(solver='liblinear')
clr.fit(X_train, y_train)
LogisticRegression(solver='liblinear')
import numpy
from mlprodict.onnx_conv import to_onnx
onx = to_onnx(clr, X_test.astype(numpy.float32))
with open("logreg_time.onnx", "wb") as f:
    f.write(onx.SerializeToString())
# add -l 1 if nothing shows up
%onnxview onx
from mlprodict.onnxrt import OnnxInference
import pandas
oinf = OnnxInference(onx)
res = oinf.run({'X': X_test}, node_time=True)
pandas.DataFrame(list(res[1]))
i name op_type time
0 0 LinearClassifier LinearClassifier 0.199603
1 1 Normalizer Normalizer 0.000091
2 2 Cast Cast 0.000014
3 3 ZipMap ZipMap 0.000016
oinf.run({'X': X_test})['output_probability'][:5]
{0: array([8.38235830e-01, 1.21554664e-03, 6.97352537e-04, 7.93823160e-01,
        9.24825077e-01]),
 1: array([0.16162989, 0.39692812, 0.25688601, 0.20607722, 0.07516498]),
 2: array([1.34279470e-04, 6.01856333e-01, 7.42416637e-01, 9.96200831e-05,
        9.94208860e-06])}

Measure time spent in each node

With parameter node_time=True, method run returns the output and time measurement.

exe = oinf.run({'X': X_test}, node_time=True)
exe[1]
[{'i': 0,
  'name': 'LinearClassifier',
  'op_type': 'LinearClassifier',
  'time': 0.00015699999999974068},
 {'i': 1,
  'name': 'Normalizer',
  'op_type': 'Normalizer',
  'time': 5.43000000003957e-05},
 {'i': 2, 'name': 'Cast', 'op_type': 'Cast', 'time': 1.1699999999947863e-05},
 {'i': 3,
  'name': 'ZipMap',
  'op_type': 'ZipMap',
  'time': 1.940000000111297e-05}]
import pandas
pandas.DataFrame(exe[1])
i name op_type time
0 0 LinearClassifier LinearClassifier 0.000157
1 1 Normalizer Normalizer 0.000054
2 2 Cast Cast 0.000012
3 3 ZipMap ZipMap 0.000019

Logistic regression: python runtime vs onnxruntime

Function enumerate_validated_operator_opsets implements automated tests for every model with artificial data. Option node_time automatically returns the time spent in each node and does it multiple time.

from mlprodict.onnxrt.validate import enumerate_validated_operator_opsets
res = list(enumerate_validated_operator_opsets(
            verbose=0, models={"LogisticRegression"}, opset_min=12,
            runtime='python', debug=False, node_time=True,
            filter_exp=lambda m, p: p == "b-cl"))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
import pandas
df = pandas.DataFrame(res[0]['bench-batch'])
df['step'] = df.apply(lambda row: '{}-{}'.format(row['i'], row["name"]), axis=1)
df
i name op_type time N max_time min_time repeat number step
0 0 LinearClassifier LinearClassifier 0.000018 1 0.000033 0.000015 20 30 0-LinearClassifier
1 1 Normalizer Normalizer 0.000017 1 0.000069 0.000012 20 30 1-Normalizer
2 2 Cast Cast 0.000004 1 0.000009 0.000003 20 30 2-Cast
3 3 ZipMap ZipMap 0.000005 1 0.000007 0.000004 20 30 3-ZipMap
4 0 LinearClassifier LinearClassifier 0.000020 10 0.000052 0.000017 20 20 0-LinearClassifier
5 1 Normalizer Normalizer 0.000015 10 0.000035 0.000013 20 20 1-Normalizer
6 2 Cast Cast 0.000004 10 0.000013 0.000003 20 20 2-Cast
7 3 ZipMap ZipMap 0.000004 10 0.000008 0.000004 20 20 3-ZipMap
8 0 LinearClassifier LinearClassifier 0.000024 100 0.000036 0.000019 10 8 0-LinearClassifier
9 1 Normalizer Normalizer 0.000018 100 0.000023 0.000015 10 8 1-Normalizer
10 2 Cast Cast 0.000004 100 0.000006 0.000003 10 8 2-Cast
11 3 ZipMap ZipMap 0.000007 100 0.000005 0.000004 10 8 3-ZipMap
12 0 LinearClassifier LinearClassifier 0.000051 1000 0.000057 0.000047 5 5 0-LinearClassifier
13 1 Normalizer Normalizer 0.000041 1000 0.000045 0.000040 5 5 1-Normalizer
14 2 Cast Cast 0.000003 1000 0.000004 0.000003 5 5 2-Cast
15 3 ZipMap ZipMap 0.000004 1000 0.000004 0.000004 5 5 3-ZipMap
16 0 LinearClassifier LinearClassifier 0.000315 10000 0.000328 0.000315 3 3 0-LinearClassifier
17 1 Normalizer Normalizer 0.000272 10000 0.000284 0.000256 3 3 1-Normalizer
18 2 Cast Cast 0.000004 10000 0.000004 0.000004 3 3 2-Cast
19 3 ZipMap ZipMap 0.000004 10000 0.000004 0.000004 3 3 3-ZipMap
20 0 LinearClassifier LinearClassifier 0.005634 100000 0.005634 0.005634 1 2 0-LinearClassifier
21 1 Normalizer Normalizer 0.004671 100000 0.004671 0.004671 1 2 1-Normalizer
22 2 Cast Cast 0.000024 100000 0.000024 0.000024 1 2 2-Cast
23 3 ZipMap ZipMap 0.000013 100000 0.000013 0.000013 1 2 3-ZipMap

Following tables shows the time spent in each node, it is relative to the total time. For one observation, the runtime spends 10% of the time in ZipMap, it is only 1% or 2% with 10 observations. These proportions change due to the computing cost of each node.

piv = df.pivot('step', 'N', 'time')
total = piv.sum(axis=0)
piv / total
N 1 10 100 1000 10000 100000
step
0-LinearClassifier 0.410138 0.459103 0.450882 0.512622 0.530490 0.544785
1-Normalizer 0.390060 0.353622 0.350126 0.414227 0.456671 0.451642
2-Cast 0.095729 0.089857 0.074343 0.034398 0.006092 0.002306
3-ZipMap 0.104073 0.097418 0.124649 0.038753 0.006747 0.001267

The python implementation of ZipMap does not change the data but wraps in into a frozen class ArrayZipMapDitionary which mocks a list of dictionaries pandas can ingest to create a DataFrame. The cost is a fixed cost and does not depend on the number of processed rows.

from pyquickhelper.pycode.profiling import profile
bigX = numpy.random.randn(100000, X_test.shape[1]).astype(numpy.float32)
print(profile(lambda: oinf.run({'X': bigX}), pyinst_format="text")[1])
  _     ._   __/__   _ _  _  _ _/_   Recorded: 00:28:08  Samples:  4
 /_//_/// /_/ //_// / //_'/ //     Duration: 0.009     CPU time: 0.031
/   _/                      v3.0.1
Program: c:python372_x64libsite-packagesipykernel_launcher.py -f C:UsersxavieAppDataRoamingjupyterruntimekernel-287476aa-b8ba-4140-902a-b0aad833ffd0.json
0.008 profile  pyquickhelperpycodeprofiling.py:49
`- 0.008 <lambda>  <ipython-input-13-ccd42692a7ed>:3
   `- 0.008 run  mlprodictonnxrtonnx_inference.py:475
      `- 0.008 _run_sequence_runtime  mlprodictonnxrtonnx_inference.py:558
         `- 0.008 run  mlprodictonnxrtonnx_inference_node.py:141
            |- 0.005 run  mlprodictonnxrtops_cpu_op.py:417
            |  `- 0.005 run  mlprodictonnxrtops_cpu_op.py:298
            |     `- 0.005 _run  mlprodictonnxrtops_cpuop_linear_classifier.py:40
            |        |- 0.003 [self]
            |        `- 0.002 argmax  <__array_function__ internals>:2
            |           `- 0.002 argmax  numpycorefromnumeric.py:1112
            |                 [3 frames hidden]  numpy
            |                    0.002 _wrapfunc  numpycorefromnumeric.py:55
            `- 0.003 run  mlprodictonnxrtops_cpu_op.py:383
               `- 0.003 run  mlprodictonnxrtops_cpu_op.py:298
                  `- 0.003 _run  mlprodictonnxrtops_cpuop_normalizer.py:66
                     `- 0.003 norm_l1  mlprodictonnxrtops_cpuop_normalizer.py:42
                        `- 0.003 _norm_L1_inplace  mlprodictonnxrtops_cpuop_normalizer.py:49
                           |- 0.002 [self]
                           `- 0.002 _sum  numpycore_methods.py:36
                                 [2 frames hidden]  numpy

The class ArrayZipMapDictionary is fast to build but has an overhead after that because it builds data when needed.

res = oinf.run({'X': bigX})
prob = res['output_probability']
type(prob)
mlprodict.onnxrt.ops_cpu.op_zipmap.ArrayZipMapDictionary
%timeit pandas.DataFrame(prob)
721 ms ± 54.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
list_of_dict = [v.asdict() for v in prob]
%timeit pandas.DataFrame(list_of_dict)
108 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But if you just need to do the following:

%timeit pandas.DataFrame(prob).values
713 ms ± 56.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then, you can just do that:

print(prob.columns)
%timeit prob.values
[0, 1, 2]
390 ns ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

And then:

%timeit -n 100 pandas.DataFrame(prob.values, columns=prob.columns)
215 µs ± 82.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can then compare to what onnxruntime would do when the runtime is called indenpently for each node. We use the runtime named onnxruntime2. Class OnnxInference splits the ONNX graph into multiple ONNX graphs, one for each node, and then calls onnxruntime for each of them indenpently. Python handles the graph logic.

res = list(enumerate_validated_operator_opsets(
            verbose=0, models={"LogisticRegression"}, opset_min=12,
            runtime='onnxruntime2', debug=False, node_time=True))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_model_logistic.py:1356: UserWarning: 'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 4.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
res0 = None
for i, r in enumerate(res):
    if "available-ERROR" in r:
        print(i, str(r['available-ERROR']).split("\n")[0])
    elif res0 is None:
        res0 = r
0 Unable to load node 'ZipMap' (output type was inferred)
1 Unable to load node 'ZipMap' (output type was inferred)
4 Unable to load node 'LinearClassifier' (output type was guessed)
5 Unable to load node 'LinearClassifier' (output type was guessed)
6 Unable to load node 'LinearClassifier' (output type was guessed)
7 Unable to load node 'LinearClassifier' (output type was guessed)
8 Unable to load node 'ZipMap' (output type was inferred)
9 Unable to load node 'ZipMap' (output type was inferred)
if '_6ort_run_batch_exc' in res[0]:
    m = "Something went wrong.", res[0]['_6ort_run_batch_exc']
else:
    df = pandas.DataFrame(res0['bench-batch'])
    print(df)
    df['step'] = df.apply(lambda row: '{}-{}'.format(row['i'], row["name"]), axis=1)
    piv = df.pivot('step', 'N', 'time')
    total = piv.sum(axis=0)
    m = piv / total
m
   i              name           op_type      time       N  max_time  0  0  LinearClassifier  LinearClassifier  0.000052       1  0.000190
1  0  LinearClassifier  LinearClassifier  0.000044      10  0.000070
2  0  LinearClassifier  LinearClassifier  0.000071     100  0.000133
3  0  LinearClassifier  LinearClassifier  0.000066    1000  0.000079
4  0  LinearClassifier  LinearClassifier  0.000409   10000  0.000408
5  0  LinearClassifier  LinearClassifier  0.003275  100000  0.003275
   min_time  repeat  number
0  0.000028      20      30
1  0.000031      20      20
2  0.000046      10       8
3  0.000057       5       5
4  0.000365       3       3
5  0.003275       1       2
N 1 10 100 1000 10000 100000
step
0-LinearClassifier 1.0 1.0 1.0 1.0 1.0 1.0

onnxruntime creates a new container each time a ZipMap is executed. That’s whay it takes that much time and the ratio increases when the number of observations increases.

GaussianProcessRegressor

This operator is slow for small batches compare to scikit-learn but closes the gap as the batch size increases. Let’s see where the time goes.

from onnx.defs import onnx_opset_version
from mlprodict.tools.asv_options_helper import get_opset_number_from_onnx
onnx_opset_version(), get_opset_number_from_onnx()
(12, 12)
res = list(enumerate_validated_operator_opsets(
            verbose=1, models={"GaussianProcessRegressor"},
            opset_min=get_opset_number_from_onnx(),
            opset_max=get_opset_number_from_onnx(),
            runtime='python', debug=False, node_time=True,
            filter_exp=lambda m, p: p == "b-reg"))
[enumerate_validated_operator_opsets] opset in [12, 12].
GaussianProcessRegressor    :   0%|          | 0/1 [00:00<?, ?it/s]
[enumerate_compatible_opset] opset in [12, 12].
GaussianProcessRegressor    : 100%|██████████| 1/1 [00:05<00:00,  5.66s/it]
res0 = None
for i, r in enumerate(res):
    if "available-ERROR" in r:
        print(i, str(r['available-ERROR']).split("\n")[0])
    elif res0 is None:
        res0 = r
df = pandas.DataFrame(res0['bench-batch'])
df['step'] = df.apply(lambda row: '{0:02d}-{1}'.format(row['i'], row["name"]), axis=1)
df.head()
i name op_type time N max_time min_time repeat number step
0 0 kgpd_CDist CDist 0.000033 1 0.000045 0.000027 20 30 00-kgpd_CDist
1 1 kgpd_Div Div 0.000009 1 0.000016 0.000007 20 30 01-kgpd_Div
2 2 kgpd_Mul Mul 0.000006 1 0.000007 0.000005 20 30 02-kgpd_Mul
3 3 kgpd_Sin Sin 0.000007 1 0.000009 0.000006 20 30 03-kgpd_Sin
4 4 kgpd_Div1 Div 0.000007 1 0.000008 0.000005 20 30 04-kgpd_Div1
pivpy = df.pivot('step', 'N', 'time')
total = pivpy.sum(axis=0)
pivpy / total
N 1 10 100 1000 10000 100000
step
00-kgpd_CDist 0.311496 0.300665 0.244035 0.227984 0.264447 0.288546
01-kgpd_Div 0.082535 0.067193 0.028348 0.011667 0.012230 0.015447
02-kgpd_Mul 0.059840 0.050664 0.018670 0.006959 0.010950 0.012468
03-kgpd_Sin 0.067037 0.086529 0.106165 0.113068 0.102102 0.107563
04-kgpd_Div1 0.061852 0.053088 0.025749 0.010935 0.009810 0.009875
05-kgpd_Pow 0.072520 0.166539 0.361318 0.438253 0.418169 0.404182
06-kgpd_Mul1 0.057508 0.050477 0.020386 0.010334 0.009079 0.010466
07-kgpd_Exp 0.067885 0.098850 0.150876 0.168079 0.165177 0.145106
08-gpr_MatMul 0.137546 0.064570 0.025069 0.009029 0.007134 0.006159
09-gpr_Add 0.081782 0.061424 0.019383 0.003692 0.000903 0.000190
ax = (pivpy / total).T.plot(logx=True, figsize=(14, 4))
ax.set_ylim([0,1])
ax.set_title("Time spent in each node relatively to the total time\npython runtime");
../_images/onnx_node_time_40_0.png

The operator Scan is clearly time consuming when the batch size is small. onnxruntime is more efficient for this one.

res = list(enumerate_validated_operator_opsets(
            verbose=1, models={"GaussianProcessRegressor"},
            opset_min=get_opset_number_from_onnx(),
            opset_max=get_opset_number_from_onnx(),
            runtime='onnxruntime2', debug=False, node_time=True,
            filter_exp=lambda m, p: p == "b-reg"))
[enumerate_validated_operator_opsets] opset in [12, 12].
GaussianProcessRegressor    :   0%|          | 0/1 [00:00<?, ?it/s]
[enumerate_compatible_opset] opset in [12, 12].
GaussianProcessRegressor    : 100%|██████████| 1/1 [00:06<00:00,  6.84s/it]
try:
    df = pandas.DataFrame(res[0]['bench-batch'])
except KeyError as e:
    print("No model available.")
    r, df = None, None
if df is not None:
    df['step'] = df.apply(lambda row: '{0:02d}-{1}'.format(row['i'], row["name"]), axis=1)
    pivort = df.pivot('step', 'N', 'time')
    total = pivort.sum(axis=0)
    r = pivort / total
r
N 1 10 100 1000 10000 100000
step
00-kgpd_CDist 0.114001 0.120884 0.128845 0.148042 0.156776 0.180377
01-kgpd_Div 0.101792 0.098622 0.085983 0.085108 0.086603 0.084520
02-kgpd_Mul 0.099980 0.097547 0.084001 0.064706 0.072849 0.081023
03-kgpd_Sin 0.089632 0.103505 0.194194 0.301002 0.245769 0.260717
04-kgpd_Div1 0.099119 0.096737 0.088709 0.063237 0.095840 0.091635
05-kgpd_Pow 0.108045 0.098307 0.081161 0.064898 0.079015 0.076962
06-kgpd_Mul1 0.098561 0.098475 0.082770 0.063557 0.087732 0.076762
07-kgpd_Exp 0.090019 0.087015 0.086282 0.088542 0.103798 0.087690
08-gpr_MatMul 0.100426 0.102766 0.106220 0.102751 0.069157 0.059617
09-gpr_Add 0.098425 0.096143 0.061836 0.018155 0.002462 0.000696
if r is not None:
    ax = (pivort / total).T.plot(logx=True, figsize=(14, 4))
    ax.set_ylim([0,1])
    ax.set_title("Time spent in each node relatively to the total time\nonnxtunime");
../_images/onnx_node_time_44_0.png

The results are relative. Let’s see which runtime is best node by node.

if r is not None:
    r = (pivort - pivpy) / pivpy
r
N 1 10 100 1000 10000 100000
step
00-kgpd_CDist -0.239113 -0.367743 -0.630420 -0.703226 -0.677041 -0.631775
01-kgpd_Div 1.564119 1.308106 1.123155 2.333824 2.857590 2.223117
02-kgpd_Mul 2.473648 2.027773 2.149388 3.249448 2.624165 2.828025
03-kgpd_Sin 1.779780 0.881090 0.280401 0.216680 0.311288 0.427768
04-kgpd_Div1 2.331710 1.865534 1.411570 1.642922 4.321977 4.466249
05-kgpd_Pow 2.097502 -0.071724 -0.842765 -0.932321 -0.897065 -0.887837
06-kgpd_Mul1 2.563218 2.067909 1.842112 1.811019 4.263897 3.320524
07-kgpd_Exp 1.756911 0.384288 -0.599693 -0.759241 -0.657668 -0.644029
08-gpr_MatMul 0.517953 1.502798 1.965953 4.201281 4.281286 4.701255
09-gpr_Add 1.502111 1.461452 1.233097 1.247736 0.486358 1.160395

Based on this, onnxruntime is faster for operators Scan, Pow, Exp and slower for all the others.

Measuring the time with a custom dataset

We use the example Comparison of kernel ridge and Gaussian process regression.

import numpy
import pandas
import matplotlib.pyplot as plt
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared

rng = numpy.random.RandomState(0)

# Generate sample data
X = 15 * rng.rand(100, 1)
y = numpy.sin(X).ravel()
y += 3 * (0.5 - rng.rand(X.shape[0]))  # add noise

gp_kernel = ExpSineSquared(1.0, 5.0, periodicity_bounds=(1e-2, 1e1))
gpr = GaussianProcessRegressor(kernel=gp_kernel)
gpr.fit(X, y)
C:xavierdupre__home_github_forkscikit-learnsklearngaussian_processkernels.py:409: ConvergenceWarning: The optimal value found for dimension 0 of parameter length_scale is close to the specified lower bound 1e-05. Decreasing the bound and calling fit again may find a better value.
  ConvergenceWarning)
C:xavierdupre__home_github_forkscikit-learnsklearngaussian_processkernels.py:418: ConvergenceWarning: The optimal value found for dimension 0 of parameter periodicity is close to the specified upper bound 10.0. Increasing the bound and calling fit again may find a better value.
  ConvergenceWarning)
GaussianProcessRegressor(kernel=ExpSineSquared(length_scale=1, periodicity=5))
onx = to_onnx(gpr, X_test.astype(numpy.float64))
with open("gpr_time.onnx", "wb") as f:
    f.write(onx.SerializeToString())
%onnxview onx -r 1
from mlprodict.tools import get_ir_version_from_onnx
onx.ir_version = get_ir_version_from_onnx()
oinfpy = OnnxInference(onx, runtime="python")
oinfort = OnnxInference(onx, runtime="onnxruntime2")

runtime==onnxruntime2 tells the class OnnxInference to use onnxruntime for every node independently, there are as many calls as there are nodes in the graph.

respy = oinfpy.run({'X': X_test}, node_time=True)
try:
    resort = oinfort.run({'X': X_test}, node_time=True)
except Exception as e:
    print(e)
    resort = None
if resort is not None:
    df = pandas.DataFrame(respy[1]).merge(pandas.DataFrame(resort[1]), on=["i", "name", "op_type"],
                                        suffixes=("_py", "_ort"))
    df['delta'] = df.time_ort - df.time_py
else:
    df = None
df
i name op_type time_py time_ort delta
0 0 Sc_Scan Scan 0.007998 0.005970 -0.002028
1 1 kgpd_Transpose Transpose 0.000032 0.000599 0.000567
2 2 kgpd_Sqrt Sqrt 0.000063 0.000112 0.000049
3 3 kgpd_Div Div 0.000143 0.000097 -0.000045
4 4 kgpd_Mul Mul 0.000038 0.000321 0.000283
5 5 kgpd_Sin Sin 0.000095 0.000146 0.000051
6 6 kgpd_Div1 Div 0.000027 0.000096 0.000069
7 7 kgpd_Pow Pow 0.000299 0.000104 -0.000196
8 8 kgpd_Mul1 Mul 0.000032 0.000097 0.000065
9 9 kgpd_Exp Exp 0.000383 0.000111 -0.000271
10 10 gpr_MatMul MatMul 0.000080 0.004359 0.004279
11 11 gpr_Add Add 0.000034 0.000165 0.000131

The following function runs multiple the same inference and aggregates the results node by node.

from mlprodict.onnxrt.validate.validate import benchmark_fct
res = benchmark_fct(lambda X: oinfpy.run({'X': X_test}, node_time=True),
                    X_test, node_time=True)
df = pandas.DataFrame(res)
df[df.N == 100]
i name op_type time N max_time min_time repeat number
24 0 Sc_Scan Scan 0.004154 100 0.004330 0.003843 10 8
25 1 kgpd_Transpose Transpose 0.000013 100 0.000019 0.000010 10 8
26 2 kgpd_Sqrt Sqrt 0.000018 100 0.000022 0.000015 10 8
27 3 kgpd_Div Div 0.000025 100 0.000092 0.000015 10 8
28 4 kgpd_Mul Mul 0.000012 100 0.000019 0.000009 10 8
29 5 kgpd_Sin Sin 0.000057 100 0.000070 0.000050 10 8
30 6 kgpd_Div1 Div 0.000014 100 0.000017 0.000011 10 8
31 7 kgpd_Pow Pow 0.000172 100 0.000198 0.000155 10 8
32 8 kgpd_Mul1 Mul 0.000020 100 0.000101 0.000009 10 8
33 9 kgpd_Exp Exp 0.000213 100 0.000249 0.000193 10 8
34 10 gpr_MatMul MatMul 0.000034 100 0.000047 0.000026 10 8
35 11 gpr_Add Add 0.000013 100 0.000019 0.000011 10 8
df100 = df[df.N == 100]
%matplotlib inline
fig, ax = plt.subplots(1, 1, figsize=(14, 4))
ax.bar(df100.i, df100.time, align='center', color='orange')
ax.set_xticks(df100.i)
ax.set_yscale('log')
ax.set_xticklabels(df100.op_type)
ax.errorbar(df100.i, df100.time,
            numpy.abs(df100[["min_time", "max_time"]].T.values - df100.time.values.ravel()),
            uplims=True, lolims=True, color='blue')
ax.set_title("Time spent in each node for 100 observations\nGaussianProcess");
../_images/onnx_node_time_61_0.png
df100c = df100.cumsum()
fig, ax = plt.subplots(1, 1, figsize=(14, 4))
ax.bar(df100.i, df100c.time, align='center', color='orange')
ax.set_xticks(df100.i)
#ax.set_yscale('log')
ax.set_ylim([df100c.min_time.min(), df100c.max_time.max()])
ax.set_xticklabels(df100.op_type)
ax.errorbar(df100.i, df100c.time,
            numpy.abs((df100c[["min_time", "max_time"]].T.values - df100c.time.values.ravel())),
            uplims=True, lolims=True)
ax.set_title("Cumulated time spent in each node for 100 observations\nGaussianProcess");
../_images/onnx_node_time_63_0.png

onnxruntime2 / onnxruntime1

The runtime onnxruntime1 uses onnxruntime for the whole ONNX graph. There is no way to get the computation time for each node except if we create a ONNX graph for each intermediate node.

oinfort1 = OnnxInference(onx, runtime='onnxruntime1')
split = oinfort1.build_intermediate()
split
OrderedDict([('scan0', OnnxInference(...)),
             ('scan1', OnnxInference(...)),
             ('kgpd_transposed0', OnnxInference(...)),
             ('kgpd_Y0', OnnxInference(...)),
             ('kgpd_C03', OnnxInference(...)),
             ('kgpd_C02', OnnxInference(...)),
             ('kgpd_output02', OnnxInference(...)),
             ('kgpd_C01', OnnxInference(...)),
             ('kgpd_Z0', OnnxInference(...)),
             ('kgpd_C0', OnnxInference(...)),
             ('kgpd_output01', OnnxInference(...)),
             ('gpr_Y0', OnnxInference(...)),
             ('GPmean', OnnxInference(...))])
dfs = []
for k, v in split.items():
    print("node", k)
    res = benchmark_fct(lambda x: v.run({'X': x}), X_test)
    df = pandas.DataFrame(res)
    df['name'] = k
    dfs.append(df.reset_index(drop=False))
node scan0
node scan1
node kgpd_transposed0
node kgpd_Y0
node kgpd_C03
node kgpd_C02
node kgpd_output02
node kgpd_C01
node kgpd_Z0
node kgpd_C0
node kgpd_output01
node gpr_Y0
node GPmean
df = pandas.concat(dfs)
df.head()
index 1 10 100 1000 10000 100000 name
0 average 0.000623 0.000592 0.000754 0.002202 0.017529 0.201192 scan0
1 deviation 0.000115 0.000030 0.000034 0.000026 0.000976 0.000000 scan0
2 min_exec 0.000541 0.000537 0.000657 0.002169 0.016677 0.201192 scan0
3 max_exec 0.000980 0.000639 0.000780 0.002239 0.018896 0.201192 scan0
4 repeat 20.000000 20.000000 10.000000 5.000000 3.000000 1.000000 scan0
df100c = df[df['index'] == "average"]
df100c_min = df[df['index'] == "min_exec"]
df100c_max = df[df['index'] == "max_exec"]
ave = df100c.iloc[:, 4]
ave_min = df100c_min.iloc[:, 4]
ave_max = df100c_max.iloc[:, 4]
ave.shape, ave_min.shape, ave_max.shape
index = numpy.arange(ave.shape[0])
fig, ax = plt.subplots(1, 1, figsize=(14, 4))
ax.bar(index, ave, align='center', color='orange')
ax.set_xticks(index)
ax.set_xticklabels(df100c.name)
for tick in ax.get_xticklabels():
    tick.set_rotation(20)
ax.errorbar(index, ave,
            numpy.abs((numpy.vstack([ave_min.values, ave_max.values]) - ave.values.ravel())),
            uplims=True, lolims=True)
ax.set_title("Cumulated time spent in each node for 100 "
             "observations\nGaussianProcess and onnxruntime1");
../_images/onnx_node_time_70_0.png

The visual graph helps matching the output names with the operator type. The curve is not monotononic because each experiment computes every output from the start. The number of repetitions should be increased. Documentation of function benchmark_fct tells how to do it.