Fast TopK elements#

Links: notebook, html, PDF, python, slides, GitHub

Looking for the top k elements is something needed to implement a simple k nearest neighbors. The implementation scikit-learn is using relies on numpy: _kneighbors_reduce_func. mlprodict also contains a C++ implementation of the same function. Let’s compare them.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline

Two implementations#

We assume we are looking for the k nearest elements of every row of matrix X which is a dense matrix of doubles.

import numpy.random as rnd
from sklearn.neighbors._base import KNeighborsMixin

mixin = KNeighborsMixin()
def topk_sklearn(X, k):
    return mixin._kneighbors_reduce_func(X, 0, k, False)

X = rnd.randn(5, 10)
ind = topk_sklearn(X, 3)
ind
array([[2, 7, 3],
       [7, 0, 8],
       [1, 5, 6],
       [8, 9, 3],
       [4, 6, 5]], dtype=int64)

Now the implementation with mlprodict (C++) available at topk_element_min. It uses heap.

from mlprodict.onnxrt.ops_cpu._op_onnx_numpy import topk_element_min_double
def topk_cpp(X, k):
    return topk_element_min_double(X, k, True, 50)

ind = topk_cpp(X, 3)
ind
array([[2, 7, 3],
       [7, 0, 8],
       [1, 5, 6],
       [8, 9, 3],
       [4, 6, 5]], dtype=int64)

Speed comparison by size#

%timeit topk_sklearn(X, 3)
21.7 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit topk_cpp(X, 3)
4.1 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Quite a lot faster on this simple example. Let’s look for bigger matrices.

X = rnd.randn(1000, 100)
%timeit topk_sklearn(X, 10)
1.8 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit topk_cpp(X, 10)
786 µs ± 116 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
from cpyquickhelper.numbers import measure_time
from tqdm import tqdm
from pandas import DataFrame

rows = []
for n in tqdm(range(1000, 10001, 1000)):
    X = rnd.randn(n, 1000)
    res = measure_time('topk_sklearn(X, 20)',
                       {'X': X, 'topk_sklearn': topk_sklearn},
                       div_by_number=True,
                       number=2, repeat=2)
    res["N"] = n
    res["name"] = 'topk_sklearn'
    rows.append(res)
    res = measure_time('topk_cpp(X, 20)',
                       {'X': X, 'topk_cpp': topk_cpp},
                       div_by_number=True,
                       number=4, repeat=4)
    res["N"] = n
    res["name"] = 'topk_cpp'
    rows.append(res)

df = DataFrame(rows)
df.head()
100%|██████████| 10/10 [00:08<00:00,  1.16it/s]
average deviation min_exec max_exec repeat number context_size N name
0 0.016310 0.000260 0.016050 0.016571 2 2 240 1000 topk_sklearn
1 0.003872 0.000501 0.003335 0.004631 4 4 240 1000 topk_cpp
2 0.034684 0.001629 0.033055 0.036313 2 2 240 2000 topk_sklearn
3 0.006973 0.000558 0.006307 0.007756 4 4 240 2000 topk_cpp
4 0.051934 0.000851 0.051084 0.052785 2 2 240 3000 topk_sklearn
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
piv = df.pivot("N", "name", "average")
piv.plot(ax=ax[0], logy=True, logx=True)
ax[0].set_title("top 20")
piv["ratio"] = piv["topk_cpp"] / piv["topk_sklearn"]
piv[["ratio"]].plot(ax=ax[1])
ax[1].set_title("ratio");
../_images/topk_cpp_17_0.png

Speed comparison by k#

rows = []
X = rnd.randn(2000, 1000)
for k in tqdm(list(range(1, 20)) + list(range(20, 1000, 20))):
    res = measure_time('topk_sklearn(X, k)',
                       {'X': X, 'topk_sklearn': topk_sklearn, 'k': k},
                       div_by_number=True,
                       number=2, repeat=2)
    res["k"] = k
    res["name"] = 'topk_sklearn'
    rows.append(res)
    res = measure_time('topk_cpp(X, k)',
                       {'X': X, 'topk_cpp': topk_cpp, 'k': k},
                       div_by_number=True,
                       number=2, repeat=2)
    res["k"] = k
    res["name"] = 'topk_cpp'
    rows.append(res)

df = DataFrame(rows)
df.head()
100%|██████████| 68/68 [00:34<00:00,  1.95it/s]
average deviation min_exec max_exec repeat number context_size k name
0 0.009558 0.001392 0.008166 0.010949 2 2 240 1 topk_sklearn
1 0.002665 0.000571 0.002094 0.003236 2 2 240 1 topk_cpp
2 0.021933 0.000575 0.021358 0.022508 2 2 240 2 topk_sklearn
3 0.001000 0.000084 0.000917 0.001084 2 2 240 2 topk_cpp
4 0.025986 0.001411 0.024575 0.027398 2 2 240 3 topk_sklearn
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
piv = df.pivot("k", "name", "average")
piv.plot(ax=ax[0], logy=True, logx=True)
ax[0].set_title("top k for 2000 rows")
piv["ratio"] = piv["topk_cpp"] / piv["topk_sklearn"]
piv[["ratio"]].plot(ax=ax[1])
ax[1].set_title("ratio")
ax[0].set_xlabel("k")
ax[1].set_xlabel("k");
../_images/topk_cpp_20_0.png

The implementation is half faster in all cases and much more efficient for small values which is usually the case for the nearest neighbors. This implementation is using openmp, maybe that’s why it gets 50% faster on this two cores machine.