Graphes en machine learning - correction

Links: notebook, html ., PDF, python, slides ., presentation ., GitHub

Correction (en cours de rédaction) des exercices autour des graphes courants en machine learning.

%pylab inline
Populating the interactive namespace from numpy and matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from jyquickhelper import add_notebook_menu
add_notebook_menu()
Plan
run previous cell, wait for 2 seconds

Le module utilise des données issue de Wine Quality Data Set pour lequel on essaye de prédire la qualité du vin en fonction de ses caractéristiques chimiques.

import pyensae
from pyensae.datasource import DownloadDataException
uci = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/"
try:
    pyensae.download_data("winequality-red.csv", url=uci)
    pyensae.download_data("winequality-white.csv", url=uci)
except DownloadDataException:
    print("backup")
    pyensae.download_data("winequality-red.csv", website="xd")
    pyensae.download_data("winequality-white.csv", website="xd")
%head winequality-red.csv
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7

import pandas
red_wine = pandas.read_csv("winequality-red.csv", sep=";")
red_wine["red"] = 1
white_wine = pandas.read_csv("winequality-white.csv", sep=";")
white_wine["red"] = 0
wines = pandas.concat([red_wine, white_wine])
wines.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality red
0 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 1
1 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5 1
2 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5 1
3 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6 1
4 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5 1

On découpe en base d’apprentissage, base de test :

from sklearn.cross_validation import train_test_split
X = wines[[c for c in wines.columns if c != "quality"]]
Y = wines["quality"]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
type(x_train), type(y_train)
(pandas.core.frame.DataFrame, pandas.core.series.Series)
wines.shape, x_train.shape, y_train.shape
((6497, 13), (4352, 12), (4352,))

Exercice 1 : créer une fonction pour automatiser la création de ce graphe

Exercice 2 : simplifier l’apprentissage de chaque modèle