module onnxrt.optim.sklearn_helper
¶
Short summary¶
module mlprodict.onnxrt.optim.sklearn_helper
Helpers to manipulate scikit-learn models.
Functions¶
function |
truncated documentation |
---|---|
Enumerate all fitted arrays included in a scikit-learn object. |
|
Enumerates all the models within a pipeline. |
|
Inspects a scikit-learn model and produces some figures which tries to represent the complexity of it. |
|
Retrieves the max depth assuming the estimator is a decision tree. |
|
Computes pairwise distances between two lists of arrays l1 and l2. The distance is 1e9 if shapes are not equal. |
|
Looks into model signature and add parameter n_jobs if available. The function does not overwrite the parameter. |
Documentation¶
Helpers to manipulate scikit-learn models.
-
mlprodict.onnxrt.optim.sklearn_helper.
enumerate_fitted_arrays
(model)¶ Enumerate all fitted arrays included in a scikit-learn object.
- Parameters
model – scikit-learn object
- Returns
enumerator
One example:
<<<
from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.model_selection import train_test_split from mlprodict.onnxrt.optim.sklearn_helper import enumerate_fitted_arrays iris = load_iris() X, y = iris.data, iris.target X_train, __, y_train, _ = train_test_split(X, y, random_state=11) clr = make_pipeline(PCA(n_components=2), LogisticRegression(solver="liblinear")) clr.fit(X_train, y_train) for a in enumerate_fitted_arrays(clr): print(a)
>>>
((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'mean_', array([5.812, 3.065, 3.7 , 1.178])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'components_', array([[ 0.354, -0.085, 0.858, 0.362], [ 0.663, 0.725, -0.163, -0.092]])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'explained_variance_', array([4.101, 0.254])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'explained_variance_ratio_', array([0.92 , 0.057])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'singular_values_', array([21.335, 5.309])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'classes_', array([0, 1, 2])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'coef_', array([[-2.113, 1.268], [ 0.261, -1.894], [ 2.118, -0.467]])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'intercept_', array([-1.66 , -0.764, -2.73 ])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'n_iter_', array([6], dtype=int32)), None)
-
mlprodict.onnxrt.optim.sklearn_helper.
enumerate_pipeline_models
(pipe, coor=None, vs=None)¶ Enumerates all the models within a pipeline.
- Parameters
pipe – scikit-learn pipeline
coor – current coordinate
vs – subset of variables for the model, None for all
- Returns
iterator on models
tuple(coordinate, model)
Example:
<<<
from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.model_selection import train_test_split from mlprodict.onnxrt.optim.sklearn_helper import enumerate_pipeline_models iris = load_iris() X, y = iris.data, iris.target X_train, __, y_train, _ = train_test_split(X, y, random_state=11) clr = make_pipeline(PCA(n_components=2), LogisticRegression(solver="liblinear")) clr.fit(X_train, y_train) for a in enumerate_pipeline_models(clr): print(a)
>>>
((0,), Pipeline(steps=[('pca', PCA(n_components=2)), ('logisticregression', LogisticRegression(solver='liblinear'))]), None) ((0, 0), PCA(n_components=2), None) ((0, 1), LogisticRegression(solver='liblinear'), None)
-
mlprodict.onnxrt.optim.sklearn_helper.
inspect_sklearn_model
(model, recursive=True)¶ Inspects a scikit-learn model and produces some figures which tries to represent the complexity of it.
- Parameters
model – model
recursive – recursive look
- Returns
dictionary
<<<
import pprint from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from mlprodict.onnxrt.optim.sklearn_helper import inspect_sklearn_model iris = load_iris() X = iris.data y = iris.target lr = LogisticRegression() lr.fit(X, y) pprint.pprint((lr, inspect_sklearn_model(lr))) iris = load_iris() X = iris.data y = iris.target rf = RandomForestClassifier() rf.fit(X, y) pprint.pprint((rf, inspect_sklearn_model(rf)))
>>>
/usr/local/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( (LogisticRegression(), {'ncoef': 3, 'nlin': 1, 'nop': 1}) (RandomForestClassifier(), {'max_depth': 9, 'nnodes': 1730, 'nop': 101, 'ntrees': 100})
-
mlprodict.onnxrt.optim.sklearn_helper.
max_depth
(estimator)¶ Retrieves the max depth assuming the estimator is a decision tree.
-
mlprodict.onnxrt.optim.sklearn_helper.
pairwise_array_distances
(l1, l2, metric='l1med')¶ Computes pairwise distances between two lists of arrays l1 and l2. The distance is 1e9 if shapes are not equal.
- Parameters
l1 – first list of arrays
l2 – second list of arrays
metric – metric to use, ‘l1med’ compute the average absolute error divided by the ansolute median
- Returns
matrix
-
mlprodict.onnxrt.optim.sklearn_helper.
set_n_jobs
(model, params, n_jobs=None)¶ Looks into model signature and add parameter n_jobs if available. The function does not overwrite the parameter.
- Parameters
model – model class
params – current set of parameters
n_jobs – number of CPU or n_jobs if specified or 0
- Returns
new set of parameters
On this machine, the default value is the following.
<<<
import multiprocessing print(multiprocessing.cpu_count())
>>>
8