Pairwise distances with ONNX (pdist)

Function pdist computes pairwise distances between observations in n-dimensional space. It is not that difficult to convert that into ONNX when the dimension of the input is always the same. What if not?

Function pdist

The function pdist distances. Let's denote a list of vectors $(X_1, ..., X_n)$, function pdist returns the matrix $D=(d_{ij})$ where $d_{ij}=dist(X_i, X_j)=\lVert X_i - X_j \rVert^2$.

The two following functions are implemented to reduce the number of allocations the algorithm requires.

This function computes $n^2$ distances wheres only $\frac{n(n-1)}{2}$ are necessary since the final matrix is symmetric. Let's change the implementation to reflect that.

Loop mechanism in ONNX

Operator Loop seems appropriate but it is just a loop wheras Scan holds accumulator. The first graph is what is repeated inside the loop.

The operator Scan repeats this graph a couple of times. sum_in is an accumulator, next is the iterated row from the input matrix.

All together in the same graph.

Back to pdist

sklearn-onnx implements function pdist with ONNX operators. The parameter inputs=[('x', FloatTensorType()) tels the method to_onnx that the dimension of the inputs is not fixed and should not be checked.

Notice the double arrow. Input x is used twice, once as an permanent state involved in broacasted substract, another time to iterator rows. On the other side, the first output of operator Scan is a permanent state equal to the input, the second one is an aggregation of results produced at each iteration. Each of those produces a row of a final matrix.

All together.

Let's now execute the graph and compare it with the original graph.

Benchmark

Curves are not linear and rather difficult to interpret. The algorithm numpy-lower and scipy should be close as the cost of both algorithm are similar. However, scipy reduces the number of trips between C and python. The C implementation of the distance is here: sqeuclidean_distance_double. The final cost is a combination of computation, multithreading, allocations...

Universal function do not seem to be very efficient in our case. The last graph shows time ratio between implementations of pdist and the baseline scipy.

Test with a new operator CDist

The final question is: should we introduce a new operator into ONNX specifications? The function pdist is not necessarily often used for a big number of observations as the square matrix it produces will even bigger. It seems reasonable. We showed that a python runtime based on numpy would not help, the implementation must be done in C++ or directly used the scipy version. The experiment was done with a GaussianProcessRegressor. The following section tests with and without a new operator CDist reusing scipy implementation.

It is 10 times faster for this datasets so it is worth it. For bigger datasets, we should expect a lower gain but still significant.