2019-12-04 RandomForestClassifier - prediction for one observation#

I was meeting with Olivier Grisel this morning and we were wondering why scikit-learn was slow to compute the prediction of a random forest for one observation compare to what onnxruntime does, and more specically some optimized C++ code inspired from onnxruntime. We used py-spy and wrote the following script:

import numpy as np
from joblib import Memory
import sklearn
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from skl2onnx import to_onnx
from mlprodict.onnxrt import OnnxInference

m = Memory(location="c:\\temp", mmap_mode='r')

@m.cache
def make_model():
    X, y = make_classification(
        n_features=2, n_redundant=0, n_informative=2,
        random_state=1, n_clusters_per_class=1, n_samples=1000)
    rf = RandomForestClassifier(
        max_depth=5, n_estimators=100, max_features=1)
    rf.fit(X, y)
    onx = to_onnx(rf, X.astype(np.float32))
    onxb = onx.SerializeToString()
    return rf, X, onxb

rf, X, onxb = make_model()
X = X.astype(np.float32)
oinf = OnnxInference(onxb, runtime="python")
oinf_ort = OnnxInference(onxb, runtime="onnxruntime1")

def f1():
    for _ in range(0, 10):
        for i in range(0, X.shape[0]):
            y = oinf.run({'X': X[i: i+1]})

def f2():
    with sklearn.config_context(assume_finite=True):
        for _ in range(0, 10):
            for i in range(0, X.shape[0]):
                rf.predict(X[i: i+1])

f1()  # C++ code
f2()  # scikit-learn

The script is run with the following command line:

py-spy record --native --function --rate=10 -o demo.svg -- python demo.py

The following image is a snapshot of the final result:

../../_images/rf1.png

It shows that on my machine, scikit-learn is delayed by joblib which is not really useful for this small dataset. check_is_fitted takes some time despite the fact sklearn.config_context(assume_finite=True) was used. We then modified the script to compare this outcome to what we would get for the prediction of 1000 observations in a row.

import numpy as np
from joblib import Memory
import sklearn
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from skl2onnx import to_onnx
from mlprodict.onnxrt import OnnxInference

m = Memory(location="c:\\temp", mmap_mode='r')

@m.cache
def make_model():
    X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                               random_state=1, n_clusters_per_class=1, n_samples=1000)
    rf = RandomForestClassifier(max_depth=5, n_estimators=100, max_features=1)
    rf.fit(X, y)
    onx = to_onnx(rf, X.astype(np.float32))
    onxb = onx.SerializeToString()
    return rf, X, onxb

rf, X, onxb = make_model()
X = X.astype(np.float32)
oinf = OnnxInference(onxb, runtime="python")

def f1_1000():
    for _ in range(0, 5000):
        y = oinf.run({'X': X})

def f2_1000():
    with sklearn.config_context(assume_finite=True):
        for _ in range(0, 5000):
            rf.predict(X)

f1_1000()  # C++
f2_1000()  # scikit-learn

Both versions spend similar time into the functions which compute the predictions but joblib is still adding some extra time.

../../_images/rf123.png

Figures about other classifiers can be found at Prediction time scikit-learn / onnxruntime for common datasets. It shows the predictions time on breast cancer and digits datasets.