module onnx_tools.optim.sklearn_helper
#
Short summary#
module mlprodict.onnx_tools.optim.sklearn_helper
Helpers to manipulate scikit-learn models.
Functions#
function |
truncated documentation |
---|---|
Enumerate all fitted arrays included in a scikit-learn object. |
|
Enumerates all the models within a pipeline. |
|
Inspects a scikit-learn model and produces some figures which tries to represent the complexity of it. |
|
Retrieves the max depth assuming the estimator is a decision tree. |
|
Computes pairwise distances between two lists of arrays l1 and l2. The distance is 1e9 if shapes are not equal. |
|
Looks into model signature and add parameter n_jobs if available. The function does not overwrite the parameter. |
Documentation#
Helpers to manipulate scikit-learn models.
- mlprodict.onnx_tools.optim.sklearn_helper.enumerate_fitted_arrays(model)#
Enumerate all fitted arrays included in a scikit-learn object.
- Parameters:
model – scikit-learn object
- Returns:
enumerator
One example:
<<<
from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.model_selection import train_test_split from mlprodict.onnx_tools.optim.sklearn_helper import enumerate_fitted_arrays iris = load_iris() X, y = iris.data, iris.target X_train, __, y_train, _ = train_test_split(X, y, random_state=11) clr = make_pipeline(PCA(n_components=2), LogisticRegression(solver="liblinear")) clr.fit(X_train, y_train) for a in enumerate_fitted_arrays(clr): print(a)
>>>
((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'mean_', array([5.812, 3.065, 3.7 , 1.178])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'components_', array([[ 0.354, -0.085, 0.858, 0.362], [ 0.663, 0.725, -0.163, -0.092]])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'explained_variance_', array([4.101, 0.254])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'explained_variance_ratio_', array([0.92 , 0.057])), None) ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'singular_values_', array([21.335, 5.309])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'classes_', array([0, 1, 2])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'coef_', array([[-2.113, 1.268], [ 0.261, -1.894], [ 2.118, -0.467]])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'intercept_', array([-1.66 , -0.764, -2.73 ])), None) ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'n_iter_', array([6, 4, 6], dtype=int32)), None)
- mlprodict.onnx_tools.optim.sklearn_helper.enumerate_pipeline_models(pipe, coor=None, vs=None)#
Enumerates all the models within a pipeline.
- Parameters:
pipe – scikit-learn pipeline
coor – current coordinate
vs – subset of variables for the model, None for all
- Returns:
iterator on models
tuple(coordinate, model)
Example:
<<<
from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline from sklearn.model_selection import train_test_split from mlprodict.onnx_tools.optim.sklearn_helper import enumerate_pipeline_models iris = load_iris() X, y = iris.data, iris.target X_train, __, y_train, _ = train_test_split(X, y, random_state=11) clr = make_pipeline(PCA(n_components=2), LogisticRegression(solver="liblinear")) clr.fit(X_train, y_train) for a in enumerate_pipeline_models(clr): print(a)
>>>
((0,), Pipeline(steps=[('pca', PCA(n_components=2)), ('logisticregression', LogisticRegression(solver='liblinear'))]), None) ((0, 0), PCA(n_components=2), None) ((0, 1), LogisticRegression(solver='liblinear'), None)
- mlprodict.onnx_tools.optim.sklearn_helper.inspect_sklearn_model(model, recursive=True)#
Inspects a scikit-learn model and produces some figures which tries to represent the complexity of it.
- Parameters:
model – model
recursive – recursive look
- Returns:
dictionary
<<<
import pprint from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris from mlprodict.onnx_tools.optim.sklearn_helper import inspect_sklearn_model iris = load_iris() X = iris.data y = iris.target lr = LogisticRegression() lr.fit(X, y) pprint.pprint((lr, inspect_sklearn_model(lr))) iris = load_iris() X = iris.data y = iris.target rf = RandomForestClassifier() rf.fit(X, y) pprint.pprint((rf, inspect_sklearn_model(rf)))
>>>
/var/lib/jenkins/.local/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( (LogisticRegression(), {'ncoef': 3, 'nlin': 1, 'nop': 1}) (RandomForestClassifier(), {'max_depth': 10, 'nnodes': 1724, 'nop': 101, 'ntrees': 100})
- mlprodict.onnx_tools.optim.sklearn_helper.max_depth(estimator)#
Retrieves the max depth assuming the estimator is a decision tree.
- mlprodict.onnx_tools.optim.sklearn_helper.pairwise_array_distances(l1, l2, metric='l1med')#
Computes pairwise distances between two lists of arrays l1 and l2. The distance is 1e9 if shapes are not equal.
- Parameters:
l1 – first list of arrays
l2 – second list of arrays
metric – metric to use, ‘l1med’ compute the average absolute error divided by the ansolute median
- Returns:
matrix
- mlprodict.onnx_tools.optim.sklearn_helper.set_n_jobs(model, params, n_jobs=None)#
Looks into model signature and add parameter n_jobs if available. The function does not overwrite the parameter.
- Parameters:
model – model class
params – current set of parameters
n_jobs – number of CPU or n_jobs if specified or 0
- Returns:
new set of parameters
On this machine, the default value is the following.
<<<
import multiprocessing print(multiprocessing.cpu_count())
>>>
8