2A.ml - Librairies de machine learning

Links: notebook, html, PDF, python, slides, GitHub

Revue non exhaustive de librairies de machine learning sous Python.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline
import matplotlib.pyplot as plt

Voir aussi Awesome Machine Learning.

Classiques

scikit-learn

scikit-learn a défini le standard de machine learning en Python.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

logreg = linear_model.LogisticRegression()
logreg.fit(X, Y)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

plt.scatter(X[:, 0], X[:, 1], c=Y, lw=1, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
<matplotlib.text.Text at 0x21ea4888358>
../_images/ml_d_libraries_6_1.png

XGBoost

XGBoost gagne beaucoup de compétition sur Kaggle.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from xgboost import XGBClassifier

iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

logreg = XGBClassifier()
logreg.fit(X, Y)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

plt.scatter(X[:, 0], X[:, 1], c=Y, lw=1, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
c:Python36_x64libsite-packagessklearncross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
<matplotlib.text.Text at 0x21ea72a9b38>
../_images/ml_d_libraries_8_2.png

LightGBM

LightGBM construit des modèles similaires à XGBoost mais est plus rapide sur la plupart des problèmes (voir Lessons Learned From Benchmarking Fast Machine Learning Algorithms.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from lightgbm import LGBMClassifier

iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

logreg = LGBMClassifier()
logreg.fit(X, Y)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

plt.scatter(X[:, 0], X[:, 1], c=Y, lw=1, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
<matplotlib.text.Text at 0x21ea3678c18>
../_images/ml_d_libraries_10_1.png

statsmodels

statsmodels pour tout ce qui concerne la machine learning linéaire et les séries temporelles. statsmodels s’applique facile sur des dataframes dont l’index est de type DatetimeIndex.

import numpy as np
from scipy import stats
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.api import qqplot
from datetime import datetime

dta = sm.datasets.sunspots.load_pandas().data
dta = dta.set_index(pandas.DatetimeIndex(dta["YEAR"].apply(lambda d: datetime(year=int(d), month=1, day=1))))
del dta["YEAR"]
dta.plot(figsize=(12,8))
c:Python36_x64libsite-packagesstatsmodelscompatpandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
<matplotlib.axes._subplots.AxesSubplot at 0x21ea358de80>
../_images/ml_d_libraries_12_2.png
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(dta.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(dta, lags=40, ax=ax2)
../_images/ml_d_libraries_13_0.png
arma_mod20 = sm.tsa.ARMA(dta, (2,0)).fit()
arma_mod20.params
const                49.659343
ar.L1.SUNACTIVITY     1.390656
ar.L2.SUNACTIVITY    -0.688571
dtype: float64
arma_mod30 = sm.tsa.ARMA(dta, (3,0)).fit()
arma_mod20.aic, arma_mod20.bic, arma_mod20.hqic
(2622.6363380639814, 2637.5697031715722, 2628.6067259092274)
import pandas
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax = arma_mod30.resid.plot(ax=ax)
../_images/ml_d_libraries_17_0.png
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
fig = qqplot(arma_mod30.resid, line='q', ax=ax, fit=True)
../_images/ml_d_libraries_18_0.png
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(arma_mod30.resid.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(arma_mod30.resid, lags=40, ax=ax2)
../_images/ml_d_libraries_19_0.png
fig, ax = plt.subplots(figsize=(12, 8))
ax = dta[-20:].plot(ax=ax)
fig = arma_mod30.plot_predict(len(dta), len(dta)+10 ,
                              dynamic=True, ax=ax, plot_insample=False)
../_images/ml_d_libraries_20_0.png

Vowpalwabbit

vowpalwabbit, github

Compiler la librairie sous Windows prend un peu de temps. La librairie implémente quelques algorithmes de façons très efficace, notamment la régression logistique. La documentation n’est pas la plus réussie.

H20

H20, Python documentation.

La librairie et écrite en java et fonctionne sur plusieurs plateforme (Python, R, Java, Spark). En contrepartie, elle implémente ses propres dataframes.

A suivre

lightning

lightning

Large scale learning, utilise des itérateurs sur les données.

Prometteur mais le module porte un nom déjà pris.

polylearn

polylearn

A library for factorization machines and polynomial networks for classification and regression in Python.

tick

tick implémente quelques modèles. Quelques auteurs ont fait l’ENSAE.

catboost

catboost est encore difficile à installer sous Python. La librairie construit des random forest tout comme XGBoost mais est spécialisée pour les variables catégorielles. Le benchmark montre qu’elle est meilleure dans ce cas.