mlinsights
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/main_0000.html
blog associated to mlinsights

scikitlearn internal API
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2020/20200902_api.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2020/20200902_api.html
The signature of method `impurity_improvement
<https://github.com/scikitlearn/scikitlearn/blob/master/
sklearn/tree/_criterion.pxd#L65>`_ will change for version
0.24. That's usually easy to handle two versions of scikitlearn
even overloaded in a class except that method is implemented
in :epkg:`cython`. The method must be overloaded the same way
with the same signature. The way it was handled is implemented
in PR `88 <https://github.com/sdpython/mlinsights/pull/88>`_.
The best would be to include both of them but only one of
them can compile. I did not find any good solution to that.
It compiles whatever scikitlearn's version but the compiled
module only works with the installed version of
:epkg:`sciktilearn`.
20200902

Nogil, numpy, cython
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190325_nogil.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190325_nogil.html
I had to implement a custom criterion to optimize
a decision tree and I wanted to leverage :epkg:`scikitlearn`
instead of rewriting my own. Version 0.21 of :epkg:`scikitlearn`
introduced some changed in the API which make possible
to overload an existing criterion and replace some of the logic
by another one: `_criterion.pyx
<https://github.com/scikitlearn/scikitlearn/blob/master/sklearn/tree/_criterion.pyx>`_.
The purpose was to show that a fast implementation requires
some tricks (see :ref:`piecewiselinearregressioncriterionrst`) and
`piecewise_tree_regression_criterion.pyx
<https://github.com/sdpython/mlinsights/blob/master/mlinsights/mlmodel/piecewise_tree_regression_criterion.pyx>`_,
`piecewise_tree_regression_criterion_fast.pyx
<https://github.com/sdpython/mlinsights/blob/master/mlinsights/mlmodel/piecewise_tree_regression_criterion_fast.pyx>`_
for the code. Other than that, every function to overlaod is marked as
:epkg:`nogil`. Every function or method marked as *nogil* cannot
go through the :epkg:`GIL` (see also :epkg:`PEP0311`),
which no :epkg:`python` object can be created in that method.
In fact, no :epkg:`python` can be called inside a :epkg:`Cython`
method protected with *nogil*. The issue with that is that
any :epkg:`numpy` method cannot be called.
My goal was to replace the implemention of the decision tree
criterion by something optimizing a linear regression, basically
something close to function :epkg:`numpy:linalg:lstsq` but that's
inside :epkg:`numpy` so unavailable in a *nogil* method.
I needed to use the inner API from :epkg:`BLAS` or :epkg:`LAPACK`
and available in :epkg:`cython` through
`cython_blas <https://docs.scipy.org/doc/scipy/reference/linalg.cython_blas.html>`_
(matrix operations)
`cython_lapack <https://docs.scipy.org/doc/scipy/reference/linalg.cython_lapack.html>`_
(complex matrix operations).
It is fast but not really well documented...
I needed to use function `dgelss
<http://www.netlib.org/lapack/explorehtml/d9/d4e/dgelss_8f.html>`_
(same from scipy `scipy.linalg.lapack.dgelss
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lapack.dgelss.html>`_).
which documentation is available through :epkg:`Lapack documentation`.
Its signature can be found at `cython_lapack_signatures.txt
<https://github.com/scipy/scipy/blob/master/scipy/linalg/cython_lapack_signatures.txt>`_
and is the following:
::
cdef void dgelss(int *m, int *n, int *nrhs, double *a, int *lda, double *b, int *ldb,
double *s, double *rcond, int *rank,
double *work, int *lwork, int *info) nogil
I tried and it failed many times before getting it correctly.
Most of the time, :epkg:`python` just crashes without telling me
what the issue is. I had to use many ``printf`` in the :epkg:`cython`
code to get it right (remember no python call, so no *print* function).
These function do not do any allocation, every needed buffer
must be allocated first. The documentation gives some recommendations
about the optimal buffer size. The function usually modifies the inputs,
they must be copied first if the user wants to reuse them later.
I finally implemented :func:`dgelss <mlinsights.mlmodel.direct_blas_lapack.dgless>` or
on github: `direct_blas_lapack.pyx
<https://github.com/sdpython/mlinsights/blob/master/src/mlinsights/mlmodel/direct_blas_lapack.pyx>`_.
20190325

Faster Polynomial Features
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190215_poly.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190215_poly.html
The current implementation of
`PolynomialFeatures
<https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html>`_
in *scikitlearn* computes each new feature
independently and that increases the number of
data exchanged between *numpy* and *Python*.
The idea of the implementation in
:class:`ExtendedFeatures <mlinsights.mlmodel.extended_features.ExtendedFeatures>`
is to reduce this number by brodcast multiplications.
The second optimization occurs by transposing the matrix:
dense matrix are organized by rows in memory so
it is faster to mulitply two rows than two columns.
See :ref:`fasterpolynomialfeaturesrst`.
20190215

Piecewise Linear Regression
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190210_piecewise.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190210_piecewise.html
I decided to turn one of the notebook I wrote about
`Piecewise Linear Regression <http://www.xavierdupre.fr/app/mlstatpy/helpsphinx/notebooks/regression_lineaire.html#regressionlineaireparmorceaux>`_.
I wanted to turn my code into something usable and following
the *scikitlearn* API:
:class:`PiecewiseRegression <mlinsights.mlmodel.piecewise_estimator.PiecewiseRegression>`
and another notebook :ref:`piecewiselinearregressionrst`.
20190210

Predictable tSNE
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190201_tsne.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190201_tsne.html
:epkg:`tSNE` is quite an interesting tool to
visualize data on a map but it has one drawback:
results are not reproducible. It is much more powerful
than a :epkg:`PCA` but the results is difficult to
interpret. Based on some experiment, if :epkg:`tSNE`
manages to separate classes, there is a good chance that
a classifier can get good performances. Anyhow, I implemented
a regressor which approximates the :epkg:`tSNE` outputs
so that it can be used as features for a further classifier.
I create a notebook :ref:`predictabletsnerst` and a new tranform
:class:`PredictableTSNE <mlinsights.sklapi.predictable_tsne.PredictableTSNE>`.
20190201

Pipeline visualization
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190201_pipeline.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2019/20190201_pipeline.html
:epkg:`scikitlearn` introduced nice feature to
be able to process mixed type column in a single
pipeline which follows :epkg:`scikitlearn` API:
`ColumnTransformer <https://scikitlearn.org/stable/
modules/generated/sklearn.compose.ColumnTransformer.html>`_
`FeatureUnion <https://scikitlearn.org/stable/modules/
generated/sklearn.pipeline.FeatureUnion.html>`_ and
`Pipeline <https://scikitlearn.org/stable/modules/
generated/sklearn.pipeline.Pipeline.html>`_. Ideas are not
new but it is finally taking place in
:epkg:`scikitlearn`.
As *a picture says a thousand words*, I tried to
do something similar to what I did for
`nimbusml <https://github.com/Microsoft/NimbusML>`_
to draw a :epkg:`scikitlearn` pipeline.
I ended it up implemented function
:ref:`pipeline2dot <mlinsights.plotting.visualize.pipeline2dot>`
which converts a pipeline into :epkg:`DOT` language
as :epkg:`scikitlearn` does for a decision tree with
`export_graphviz <https://scikitlearn.org/stable/
modules/generated/sklearn.tree.export_graphviz.html>`_.
I created the notebook :ref:`visualizepipelinerst`.
20190201

Quantile regression with scikitlearn.
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2018/20180507_quantile_regression.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2018/20180507_quantile_regression.html
:epkg:`scikitlearn` does not have any quantile regression.
:epkg:`statsmodels` does have one
`QuantReg <http://www.statsmodels.org/dev/generated/statsmodels.regression.quantile_regression.QuantReg.html>`_
but I wanted to try something I did for my teachings
`RĂ©gression Quantile
<http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx3/notebooks/td_note_2017_2.html?highlight=mediane>`_
based on `Iteratively reweighted least squares
<https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares>`_.
I thought it was a good case study to turn a simple algorithm into
a learner :epkg:`scikitlearn` can reused in a pipeline.
The notebook :ref:`quantileregressionrst` demonstrates it
and it is implemented in
:class:`QuantileLinearRegression <mlinsights.mlmodel.quantile_regression.QuantileLinearRegression>`.
20180507

Function to get insights on machine learned models
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2017/20171018_first_day.html
http://www.xavierdupre.fr/app/mlinsights/helpsphinx//blog/2017/20171018_first_day.html
Machine learned models are black boxes.
The module tries to implements some functions
to get insights on machine learned models.
20171118