module onnx_tools.optim.sklearn_helper#

Short summary#

module mlprodict.onnx_tools.optim.sklearn_helper

Helpers to manipulate scikit-learn models.

source on GitHub

Functions#

function

truncated documentation

enumerate_fitted_arrays

Enumerate all fitted arrays included in a scikit-learn object.

enumerate_pipeline_models

Enumerates all the models within a pipeline.

inspect_sklearn_model

Inspects a scikit-learn model and produces some figures which tries to represent the complexity of it.

max_depth

Retrieves the max depth assuming the estimator is a decision tree.

pairwise_array_distances

Computes pairwise distances between two lists of arrays l1 and l2. The distance is 1e9 if shapes are not equal.

set_n_jobs

Looks into model signature and add parameter n_jobs if available. The function does not overwrite the parameter.

Documentation#

Helpers to manipulate scikit-learn models.

source on GitHub

mlprodict.onnx_tools.optim.sklearn_helper.enumerate_fitted_arrays(model)#

Enumerate all fitted arrays included in a scikit-learn object.

Parameters:

modelscikit-learn object

Returns:

enumerator

One example:

<<<

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from mlprodict.onnx_tools.optim.sklearn_helper import enumerate_fitted_arrays

iris = load_iris()
X, y = iris.data, iris.target
X_train, __, y_train, _ = train_test_split(X, y, random_state=11)
clr = make_pipeline(PCA(n_components=2),
                    LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)

for a in enumerate_fitted_arrays(clr):
    print(a)

>>>

    ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'mean_', array([5.812, 3.065, 3.7  , 1.178])), None)
    ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'components_', array([[ 0.354, -0.085,  0.858,  0.362],
           [ 0.663,  0.725, -0.163, -0.092]])), None)
    ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'explained_variance_', array([4.101, 0.254])), None)
    ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'explained_variance_ratio_', array([0.92 , 0.057])), None)
    ((0, 0), PCA(n_components=2), PCA(n_components=2), (PCA(n_components=2), 'singular_values_', array([21.335,  5.309])), None)
    ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'classes_', array([0, 1, 2])), None)
    ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'coef_', array([[-2.113,  1.268],
           [ 0.261, -1.894],
           [ 2.118, -0.467]])), None)
    ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'intercept_', array([-1.66 , -0.764, -2.73 ])), None)
    ((0, 1), LogisticRegression(solver='liblinear'), LogisticRegression(solver='liblinear'), (LogisticRegression(solver='liblinear'), 'n_iter_', array([6, 4, 6], dtype=int32)), None)

source on GitHub

mlprodict.onnx_tools.optim.sklearn_helper.enumerate_pipeline_models(pipe, coor=None, vs=None)#

Enumerates all the models within a pipeline.

Parameters:
  • pipescikit-learn pipeline

  • coor – current coordinate

  • vs – subset of variables for the model, None for all

Returns:

iterator on models tuple(coordinate, model)

Example:

<<<

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from mlprodict.onnx_tools.optim.sklearn_helper import enumerate_pipeline_models

iris = load_iris()
X, y = iris.data, iris.target
X_train, __, y_train, _ = train_test_split(X, y, random_state=11)
clr = make_pipeline(PCA(n_components=2),
                    LogisticRegression(solver="liblinear"))
clr.fit(X_train, y_train)

for a in enumerate_pipeline_models(clr):
    print(a)

>>>

    ((0,), Pipeline(steps=[('pca', PCA(n_components=2)),
                    ('logisticregression', LogisticRegression(solver='liblinear'))]), None)
    ((0, 0), PCA(n_components=2), None)
    ((0, 1), LogisticRegression(solver='liblinear'), None)

source on GitHub

mlprodict.onnx_tools.optim.sklearn_helper.inspect_sklearn_model(model, recursive=True)#

Inspects a scikit-learn model and produces some figures which tries to represent the complexity of it.

Parameters:
  • model – model

  • recursive – recursive look

Returns:

dictionary

<<<

import pprint
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from mlprodict.onnx_tools.optim.sklearn_helper import inspect_sklearn_model

iris = load_iris()
X = iris.data
y = iris.target
lr = LogisticRegression()
lr.fit(X, y)
pprint.pprint((lr, inspect_sklearn_model(lr)))


iris = load_iris()
X = iris.data
y = iris.target
rf = RandomForestClassifier()
rf.fit(X, y)
pprint.pprint((rf, inspect_sklearn_model(rf)))

>>>

    /usr/local/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
    STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
    
    Increase the number of iterations (max_iter) or scale the data as shown in:
        https://scikit-learn.org/stable/modules/preprocessing.html
    Please also refer to the documentation for alternative solver options:
        https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
      n_iter_i = _check_optimize_result(
    (LogisticRegression(), {'ncoef': 3, 'nlin': 1, 'nop': 1})
    (RandomForestClassifier(),
     {'max_depth': 10, 'nnodes': 1724, 'nop': 101, 'ntrees': 100})

source on GitHub

mlprodict.onnx_tools.optim.sklearn_helper.max_depth(estimator)#

Retrieves the max depth assuming the estimator is a decision tree.

source on GitHub

mlprodict.onnx_tools.optim.sklearn_helper.pairwise_array_distances(l1, l2, metric='l1med')#

Computes pairwise distances between two lists of arrays l1 and l2. The distance is 1e9 if shapes are not equal.

Parameters:
  • l1 – first list of arrays

  • l2 – second list of arrays

  • metric – metric to use, ‘l1med’ compute the average absolute error divided by the ansolute median

Returns:

matrix

source on GitHub

mlprodict.onnx_tools.optim.sklearn_helper.set_n_jobs(model, params, n_jobs=None)#

Looks into model signature and add parameter n_jobs if available. The function does not overwrite the parameter.

Parameters:
  • model – model class

  • params – current set of parameters

  • n_jobs – number of CPU or n_jobs if specified or 0

Returns:

new set of parameters

On this machine, the default value is the following.

<<<

import multiprocessing
print(multiprocessing.cpu_count())

>>>

    8

source on GitHub