.. blogpost:: :title: Parallelization of Random Forest predictions :keywords: scikit-learn, parallelization, Random Forest :date: 2020-11-27 :categories: runtime I've been struggling to understand why the first implementation of TreeEnsemble could not get as fast as *scikit-learn* implementation for a RandomForest when the number of observations was 100.000 or above, 100 trees and a depth >= 10. The only difference was that the computation was parallelized by trees and not by observations. These observations are benchmarked in :ref:`l-example-tree-ensemble-reg-bench` (:ref:`l-example-tree-ensemble-cls-bench-multi` for the multiclass version). * `forest.py `_ * `tree.pyx `_ Parallelizing by tree requires to save the outputs of every observation. That means the computation requires an additional buffer (one per thread at least) to save the trees outputs. However, that approximatively two, three times faster to do it that way instead of parallelizing per observations. The computational is the same in both cases. The only explanation would be a better use of the caches (L1, L2, L3) when the computation is parallelized per tree. The answer is probably hidden in that book. * `What Every Programmer Should Know About Memory `_ The next investigation should be a study of the difference between a tree described as an array of nodes or a structure of arrays where every node field gets its own array. * `Performance Optimization Strategies for WRF Physics Schemes Used in Weather Modeling `_ * `Memory Layout Transformations `_ Other readings: * `Demystifying The Restrict Keyword `_ * `Aliasing `_