from sklearn.datasets import load_diabetes
data = load_diabetes()
print(data.data.shape)
print(data.target.shape)

(442, 10)
(442,)


print(data.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)


%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np


plt.hist(data.target)
plt.xlabel('progression')
plt.ylabel('count');


for index, feature_name in enumerate(data.feature_names):
    plt.figure()
    plt.scatter(data.data[:, index], data.target)
    plt.ylabel('Progression')
    plt.xlabel(feature_name)


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)


from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression()

LinearRegression()


predicted = clf.predict(X_test)
expected = y_test


plt.scatter(expected, predicted)
plt.plot([0, 350], [0, 350], '--k')
plt.axis('tight')
plt.xlabel('True Progression')
plt.ylabel('Predicted Progression')
print("RMS:", np.sqrt(np.mean((predicted - expected) ** 2)))

RMS: 46.2788680281883


from sklearn.ensemble import GradientBoostingRegressor
# Instantiate the model, fit the results, and scatter in vs. out


from sklearn.ensemble import GradientBoostingRegressor

clf = GradientBoostingRegressor()
clf.fit(X_train, y_train)

predicted = clf.predict(X_test)
expected = y_test

plt.scatter(expected, predicted)
plt.plot([0, 350], [0, 350], '--k')
plt.axis('tight')
plt.xlabel('True Progression')
plt.ylabel('Predicted Progression')
print("RMS:", np.sqrt(np.mean((predicted - expected) ** 2)))

RMS: 50.35162346769333

2A.ML101.4: Supervised Learning: Regression¶

Predicting Progression: a Simple Linear Regression¶

Exercise: Gradient Boosting Tree Regression¶

Solution:¶