.. blogpost::
    :title: RandomForestClassifier - prediction for one observation
    :keywords: scikit-learn, py-spy, benchmark, one-off prediction
    :date: 2019-12-04
    :categories: benchmark

    I was meeting with Olivier Grisel this morning and
    we were wondering why :epkg:`scikit-learn` was slow
    to compute the prediction of a random forest for
    one observation compare to what :epkg:`onnxruntime` does,
    and more specically some optimized C++ code inspired
    from :epkg:`onnxruntime`.
    We used :epkg:`py-spy` and wrote the following script:

    ::

        import numpy as np
        from joblib import Memory
        import sklearn
        from sklearn.datasets import make_classification
        from sklearn.ensemble import RandomForestClassifier
        from skl2onnx import to_onnx
        from mlprodict.onnxrt import OnnxInference

        m = Memory(location="c:\\temp", mmap_mode='r')

        @m.cache
        def make_model():
            X, y = make_classification(
                n_features=2, n_redundant=0, n_informative=2,
                random_state=1, n_clusters_per_class=1, n_samples=1000)
            rf = RandomForestClassifier(
                max_depth=5, n_estimators=100, max_features=1)
            rf.fit(X, y)
            onx = to_onnx(rf, X.astype(np.float32))
            onxb = onx.SerializeToString()
            return rf, X, onxb

        rf, X, onxb = make_model()
        X = X.astype(np.float32)
        oinf = OnnxInference(onxb, runtime="python")
        oinf_ort = OnnxInference(onxb, runtime="onnxruntime1")

        def f1():
            for _ in range(0, 10):
                for i in range(0, X.shape[0]):
                    y = oinf.run({'X': X[i: i+1]})

        def f2():
            with sklearn.config_context(assume_finite=True):
                for _ in range(0, 10):
                    for i in range(0, X.shape[0]):
                        rf.predict(X[i: i+1])

        f1()  # C++ code
        f2()  # scikit-learn

    The script is run with the following command line:

    ::

        py-spy record --native --function --rate=10 -o demo.svg -- python demo.py

    The following image is a snapshot of the final result:

    .. image:: rf1.png

    It shows that on my machine, :epkg:`scikit-learn` is delayed
    by :epkg:`joblib` which is not really useful for this small
    dataset. ``check_is_fitted`` takes some time despite the fact
    ``sklearn.config_context(assume_finite=True)`` was used.
    We then modified the script to compare this outcome to what
    we would get for the prediction of 1000 observations in a row.

    ::

        import numpy as np
        from joblib import Memory
        import sklearn
        from sklearn.datasets import make_classification
        from sklearn.ensemble import RandomForestClassifier
        from skl2onnx import to_onnx
        from mlprodict.onnxrt import OnnxInference

        m = Memory(location="c:\\temp", mmap_mode='r')

        @m.cache
        def make_model():
            X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                                       random_state=1, n_clusters_per_class=1, n_samples=1000)
            rf = RandomForestClassifier(max_depth=5, n_estimators=100, max_features=1)
            rf.fit(X, y)
            onx = to_onnx(rf, X.astype(np.float32))
            onxb = onx.SerializeToString()
            return rf, X, onxb

        rf, X, onxb = make_model()
        X = X.astype(np.float32)
        oinf = OnnxInference(onxb, runtime="python")

        def f1_1000():
            for _ in range(0, 5000):
                y = oinf.run({'X': X})

        def f2_1000():
            with sklearn.config_context(assume_finite=True):
                for _ in range(0, 5000):
                    rf.predict(X)

        f1_1000()  # C++
        f2_1000()  # scikit-learn

    Both versions spend similar time into the functions
    which compute the predictions but :epkg:`joblib` is
    still adding some extra time.

    .. image:: rf123.png

    Figures about other classifiers can be found
    at `Prediction time scikit-learn / onnxruntime for common datasets
    <http://www.xavierdupre.fr/app/_benchmarks/helpsphinx/
    onnx/onnxruntime_datasets_num.html>`_. It shows the predictions time
    on *breast cancer* and *digits* datasets.