Graphes en machine learning - correction

Links: notebook, seance6_graphes_ml_correction2html.html, PDF, python, seance6_graphes_ml_correction.slides.html, seance6_graphes_ml_correction.slides2p.html, GitHub

Correction (en cours de rédaction) des exercices autour des graphes courants en machine learning.

%matplotlib inline
%load_ext pyensae
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from jyquickhelper import add_notebook_menu
add_notebook_menu()

Le module utilise des données issue de Wine Quality Data Set pour lequel on essaye de prédire la qualité du vin en fonction de ses caractéristiques chimiques.

from pyensae.datasource import download_data, DownloadDataException
uci = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/"
try:
    download_data("winequality-red.csv", url=uci)
    download_data("winequality-white.csv", url=uci)
except DownloadDataException:
    print("backup")
    download_data("winequality-red.csv", website="xd")
    download_data("winequality-white.csv", website="xd")
%head winequality-red.csv
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7

import pandas
red_wine = pandas.read_csv("winequality-red.csv", sep=";")
red_wine["red"] = 1
white_wine = pandas.read_csv("winequality-white.csv", sep=";")
white_wine["red"] = 0
wines = pandas.concat([red_wine, white_wine])
wines.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality red
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 1
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 1
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 1
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 1
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 1

On découpe en base d’apprentissage, base de test :

from sklearn.model_selection import train_test_split
X = wines[[c for c in wines.columns if c != "quality"]]
Y = wines["quality"]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
type(x_train), type(y_train)
(pandas.core.frame.DataFrame, pandas.core.series.Series)
wines.shape, x_train.shape, y_train.shape
((6497, 13), (4352, 12), (4352,))