Time processing for every ONNX nodes in a graph

The following notebook show how long the runtime spends in each node of an ONNX graph.

LogisticRegression

Measure time spent in each node

With parameter node_time=True, method run returns the output and time measurement.

Logistic regression: python runtime vs onnxruntime

Function enumerate_validated_operator_opsets implements automated tests for every model with artificial data. Option node_time automatically returns the time spent in each node and does it multiple time.

Following tables shows the time spent in each node, it is relative to the total time. For one observation, the runtime spends 10% of the time in ZipMap, it is only 1% or 2% with 10 observations. These proportions change due to the computing cost of each node.

The python implementation of ZipMap does not change the data but wraps in into a frozen class ArrayZipMapDitionary which mocks a list of dictionaries pandas can ingest to create a DataFrame. The cost is a fixed cost and does not depend on the number of processed rows.

The class ArrayZipMapDictionary is fast to build but has an overhead after that because it builds data when needed.

But if you just need to do the following:

Then, you can just do that:

And then:

We can then compare to what onnxruntime would do when the runtime is called indenpently for each node. We use the runtime named onnxruntime2. Class OnnxInference splits the ONNX graph into multiple ONNX graphs, one for each node, and then calls onnxruntime for each of them indenpently. Python handles the graph logic.

onnxruntime creates a new container each time a ZipMap is executed. That's whay it takes that much time and the ratio increases when the number of observations increases.

GaussianProcessRegressor

This operator is slow for small batches compare to scikit-learn but closes the gap as the batch size increases. Let’s see where the time goes.

The operator Scan is clearly time consuming when the batch size is small. onnxruntime is more efficient for this one.

The results are relative. Let's see which runtime is best node by node.

Based on this, onnxruntime is faster for operators Scan, Pow, Exp and slower for all the others.

Measuring the time with a custom dataset

We use the example Comparison of kernel ridge and Gaussian process regression.

runtime==onnxruntime2 tells the class OnnxInference to use onnxruntime for every node independently, there are as many calls as there are nodes in the graph.

The following function runs multiple the same inference and aggregates the results node by node.

onnxruntime2 / onnxruntime1

The runtime onnxruntime1 uses onnxruntime for the whole ONNX graph. There is no way to get the computation time for each node except if we create a ONNX graph for each intermediate node.

The visual graph helps matching the output names with the operator type. The curve is not monotononic because each experiment computes every output from the start. The number of repetitions should be increased. Documentation of function benchmark_fct tells how to do it.