Compares dot implementations (numpy, python, blas)

numpy has a very fast implementation of the dot product. It is difficult to be better and very easy to be slower. This example looks into a couple of slower implementations.

import pprint
import numpy
import matplotlib.pyplot as plt
from pandas import DataFrame, concat
from td3a_cpp.tutorial import pydot, cblas_ddot
from td3a_cpp.tools import measure_time_dim

python dot: pydot

The first function pydot uses python to implement the dot product.

ctxs = [dict(va=numpy.random.randn(n).astype(numpy.float64),
             vb=numpy.random.randn(n).astype(numpy.float64),
             pydot=pydot,
             x_name=n)
        for n in range(10, 1000, 100)]

res_pydot = list(measure_time_dim('pydot(va, vb)', ctxs, verbose=1))

pprint.pprint(res_pydot[:2])

Out:

  0%|          | 0/10 [00:00<?, ?it/s]
 30%|###       | 3/10 [00:00<00:00, 26.54it/s]
 60%|######    | 6/10 [00:00<00:00, 11.59it/s]
 80%|########  | 8/10 [00:00<00:00,  8.10it/s]
100%|##########| 10/10 [00:01<00:00,  6.07it/s]
100%|##########| 10/10 [00:01<00:00,  7.43it/s]
[{'average': 1.8377067870460453e-05,
  'context_size': 232,
  'deviation': 2.7070670956464147e-07,
  'max_exec': 1.894579967483878e-05,
  'min_exec': 1.8155599827878178e-05,
  'number': 50,
  'repeat': 10,
  'x_name': 10},
 {'average': 7.386520196450874e-05,
  'context_size': 232,
  'deviation': 2.451752186823448e-07,
  'max_exec': 7.436139974743128e-05,
  'min_exec': 7.35670397989452e-05,
  'number': 50,
  'repeat': 10,
  'x_name': 110}]

numpy dot

ctxs = [dict(va=numpy.random.randn(n).astype(numpy.float64),
             vb=numpy.random.randn(n).astype(numpy.float64),
             dot=numpy.dot,
             x_name=n)
        for n in range(10, 50000, 100)]

res_dot = list(measure_time_dim('dot(va, vb)', ctxs, verbose=1))

pprint.pprint(res_dot[:2])

Out:

  0%|          | 0/500 [00:00<?, ?it/s]
  4%|4         | 20/500 [00:00<00:02, 192.01it/s]
  8%|8         | 40/500 [00:00<00:02, 170.91it/s]
 12%|#1        | 58/500 [00:00<00:02, 155.25it/s]
 15%|#4        | 74/500 [00:00<00:02, 142.51it/s]
 18%|#7        | 89/500 [00:00<00:03, 131.44it/s]
 21%|##        | 103/500 [00:00<00:03, 118.92it/s]
 23%|##3       | 116/500 [00:00<00:03, 112.77it/s]
 26%|##5       | 128/500 [00:01<00:03, 108.19it/s]
 28%|##7       | 139/500 [00:01<00:03, 104.51it/s]
 30%|###       | 150/500 [00:01<00:03, 101.74it/s]
 32%|###2      | 161/500 [00:01<00:03, 99.14it/s]
 34%|###4      | 171/500 [00:01<00:03, 96.94it/s]
 36%|###6      | 181/500 [00:01<00:03, 85.82it/s]
 38%|###8      | 190/500 [00:01<00:03, 84.92it/s]
 40%|###9      | 199/500 [00:01<00:03, 86.18it/s]
 42%|####1     | 208/500 [00:01<00:03, 86.79it/s]
 43%|####3     | 217/500 [00:02<00:03, 86.33it/s]
 45%|####5     | 226/500 [00:02<00:03, 85.86it/s]
 47%|####6     | 235/500 [00:02<00:03, 84.97it/s]
 49%|####8     | 244/500 [00:02<00:03, 84.27it/s]
 51%|#####     | 253/500 [00:02<00:02, 83.13it/s]
 52%|#####2    | 262/500 [00:02<00:03, 67.91it/s]
 54%|#####4    | 270/500 [00:02<00:03, 70.76it/s]
 56%|#####5    | 279/500 [00:02<00:03, 73.47it/s]
 57%|#####7    | 287/500 [00:03<00:02, 71.90it/s]
 59%|#####9    | 296/500 [00:03<00:02, 74.69it/s]
 61%|######    | 304/500 [00:03<00:02, 75.89it/s]
 62%|######2   | 312/500 [00:03<00:02, 76.39it/s]
 64%|######4   | 320/500 [00:03<00:02, 76.56it/s]
 66%|######5   | 328/500 [00:03<00:02, 76.98it/s]
 67%|######7   | 336/500 [00:03<00:02, 76.79it/s]
 69%|######8   | 344/500 [00:03<00:02, 69.96it/s]
 70%|#######   | 352/500 [00:03<00:02, 59.83it/s]
 72%|#######1  | 359/500 [00:04<00:02, 53.42it/s]
 73%|#######3  | 365/500 [00:04<00:02, 49.39it/s]
 74%|#######4  | 371/500 [00:04<00:02, 47.76it/s]
 75%|#######5  | 376/500 [00:04<00:02, 46.70it/s]
 76%|#######6  | 381/500 [00:04<00:02, 46.05it/s]
 77%|#######7  | 386/500 [00:04<00:02, 46.77it/s]
 79%|#######8  | 394/500 [00:04<00:01, 54.03it/s]
 80%|########  | 402/500 [00:04<00:01, 58.94it/s]
 82%|########1 | 408/500 [00:05<00:01, 54.14it/s]
 83%|########2 | 414/500 [00:05<00:01, 50.57it/s]
 84%|########4 | 421/500 [00:05<00:01, 54.39it/s]
 86%|########5 | 428/500 [00:05<00:01, 58.14it/s]
 87%|########7 | 435/500 [00:05<00:01, 61.00it/s]
 88%|########8 | 442/500 [00:05<00:00, 63.26it/s]
 90%|########9 | 449/500 [00:05<00:00, 64.66it/s]
 91%|#########1| 456/500 [00:05<00:00, 65.26it/s]
 93%|#########2| 463/500 [00:05<00:00, 65.30it/s]
 94%|#########3| 470/500 [00:06<00:00, 65.15it/s]
 95%|#########5| 477/500 [00:06<00:00, 64.99it/s]
 97%|#########6| 484/500 [00:06<00:00, 62.65it/s]
 98%|#########8| 491/500 [00:06<00:00, 62.95it/s]
100%|#########9| 498/500 [00:06<00:00, 60.82it/s]
100%|##########| 500/500 [00:06<00:00, 76.23it/s]
[{'average': 7.73540005320683e-06,
  'context_size': 232,
  'deviation': 1.9832059510511182e-07,
  'max_exec': 8.191520464606583e-06,
  'min_exec': 7.599919917993248e-06,
  'number': 50,
  'repeat': 10,
  'x_name': 10},
 {'average': 7.992199971340597e-06,
  'context_size': 232,
  'deviation': 1.5007598457209147e-07,
  'max_exec': 8.399920188821852e-06,
  'min_exec': 7.862320053391159e-06,
  'number': 50,
  'repeat': 10,
  'x_name': 110}]

blas dot

numpy implementation uses BLAS. Let’s make a direct call to it.

for ctx in ctxs:
    ctx['ddot'] = cblas_ddot

res_ddot = list(measure_time_dim('ddot(va, vb)', ctxs, verbose=1))

pprint.pprint(res_ddot[:2])

Out:

  0%|          | 0/500 [00:00<?, ?it/s]
  4%|4         | 21/500 [00:00<00:02, 206.27it/s]
  8%|8         | 42/500 [00:00<00:02, 180.74it/s]
 12%|#2        | 61/500 [00:00<00:02, 162.32it/s]
 16%|#5        | 78/500 [00:00<00:02, 147.78it/s]
 19%|#8        | 93/500 [00:00<00:02, 136.00it/s]
 21%|##1       | 107/500 [00:00<00:03, 119.48it/s]
 24%|##4       | 120/500 [00:00<00:03, 101.97it/s]
 26%|##6       | 131/500 [00:01<00:04, 85.52it/s]
 28%|##8       | 141/500 [00:01<00:04, 85.06it/s]
 30%|###       | 150/500 [00:01<00:04, 73.78it/s]
 32%|###1      | 159/500 [00:01<00:04, 76.20it/s]
 34%|###3      | 168/500 [00:01<00:04, 78.36it/s]
 35%|###5      | 177/500 [00:01<00:04, 80.50it/s]
 37%|###7      | 186/500 [00:01<00:03, 81.51it/s]
 39%|###9      | 195/500 [00:02<00:04, 69.61it/s]
 41%|####      | 204/500 [00:02<00:04, 72.88it/s]
 43%|####2     | 213/500 [00:02<00:03, 75.70it/s]
 44%|####4     | 222/500 [00:02<00:03, 77.30it/s]
 46%|####6     | 230/500 [00:02<00:03, 73.25it/s]
 48%|####7     | 239/500 [00:02<00:03, 75.32it/s]
 50%|####9     | 248/500 [00:02<00:03, 76.84it/s]
 51%|#####1    | 256/500 [00:02<00:03, 77.39it/s]
 53%|#####2    | 264/500 [00:02<00:03, 77.54it/s]
 54%|#####4    | 272/500 [00:03<00:02, 77.66it/s]
 56%|#####6    | 280/500 [00:03<00:03, 64.84it/s]
 58%|#####7    | 288/500 [00:03<00:03, 67.51it/s]
 59%|#####9    | 296/500 [00:03<00:02, 69.24it/s]
 61%|######    | 304/500 [00:03<00:02, 70.57it/s]
 62%|######2   | 312/500 [00:03<00:03, 57.40it/s]
 64%|######3   | 319/500 [00:03<00:03, 51.14it/s]
 65%|######5   | 325/500 [00:04<00:03, 47.55it/s]
 66%|######6   | 331/500 [00:04<00:03, 44.93it/s]
 67%|######7   | 336/500 [00:04<00:03, 45.80it/s]
 69%|######8   | 343/500 [00:04<00:03, 51.32it/s]
 70%|######9   | 349/500 [00:04<00:03, 48.35it/s]
 71%|#######1  | 355/500 [00:04<00:03, 47.29it/s]
 72%|#######2  | 362/500 [00:04<00:02, 52.53it/s]
 74%|#######3  | 369/500 [00:04<00:02, 55.62it/s]
 75%|#######5  | 376/500 [00:04<00:02, 59.27it/s]
 77%|#######6  | 383/500 [00:05<00:01, 61.85it/s]
 78%|#######8  | 390/500 [00:05<00:01, 63.27it/s]
 79%|#######9  | 397/500 [00:05<00:01, 58.84it/s]
 81%|########  | 404/500 [00:05<00:01, 49.43it/s]
 82%|########2 | 410/500 [00:05<00:02, 44.75it/s]
 83%|########3 | 416/500 [00:05<00:01, 46.62it/s]
 85%|########4 | 423/500 [00:05<00:01, 51.54it/s]
 86%|########6 | 430/500 [00:06<00:01, 55.36it/s]
 87%|########7 | 437/500 [00:06<00:01, 58.43it/s]
 89%|########8 | 444/500 [00:06<00:00, 60.55it/s]
 90%|######### | 451/500 [00:06<00:00, 61.82it/s]
 92%|#########1| 458/500 [00:06<00:00, 55.20it/s]
 93%|#########2| 464/500 [00:06<00:00, 53.79it/s]
 94%|#########3| 470/500 [00:06<00:00, 54.11it/s]
 95%|#########5| 476/500 [00:06<00:00, 54.19it/s]
 96%|#########6| 482/500 [00:06<00:00, 51.08it/s]
 98%|#########7| 488/500 [00:07<00:00, 49.43it/s]
 99%|#########9| 495/500 [00:07<00:00, 53.06it/s]
100%|##########| 500/500 [00:07<00:00, 68.52it/s]
[{'average': 6.545930111315101e-06,
  'context_size': 360,
  'deviation': 2.4200746538248667e-07,
  'max_exec': 7.162319961935282e-06,
  'min_exec': 6.3679204322397705e-06,
  'number': 50,
  'repeat': 10,
  'x_name': 10},
 {'average': 7.008284039329738e-06,
  'context_size': 360,
  'deviation': 1.5856430064739257e-07,
  'max_exec': 7.424920331686735e-06,
  'min_exec': 6.8873399868607524e-06,
  'number': 50,
  'repeat': 10,
  'x_name': 110}]

Let’s display the results

df1 = DataFrame(res_pydot)
df1['fct'] = 'pydot'
df2 = DataFrame(res_dot)
df2['fct'] = 'numpy.dot'
df3 = DataFrame(res_ddot)
df3['fct'] = 'ddot'

cc = concat([df1, df2, df3])
cc['N'] = cc['x_name']

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
cc[cc.N <= 1100].pivot('N', 'fct', 'average').plot(
    logy=True, logx=True, ax=ax[0])
cc[cc.fct != 'pydot'].pivot('N', 'fct', 'average').plot(
    logy=True, logx=True, ax=ax[1])
ax[0].set_title("Comparison of dot implementations")
ax[1].set_title("Comparison of dot implementations\nwithout python")
Comparison of dot implementations, Comparison of dot implementations without python

Out:

Text(0.5, 1.0, 'Comparison of dot implementations\nwithout python')

The results depends on the machine, its number of cores, the compilation settings of numpy or this module.

plt.show()

Total running time of the script: ( 0 minutes 19.792 seconds)

Gallery generated by Sphinx-Gallery