Decision Tree and Logistic Regression

Links: notebook, html, PDF, python, slides, GitHub

The notebook demonstrates the model DecisionTreeLogisticRegression which replaces the decision based on one variable by a logistic regression.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline
import warnings
warnings.simplefilter("ignore")

Iris dataset and logistic regression

The following code shows the border defined by two machine learning models on the Iris dataset.

import numpy
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


def plot_classifier_decision_zone(clf, X, y, title=None, ax=None):

    if ax is None:
        ax = plt.gca()

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    dhx = (x_max - x_min) / 100
    dhy = (y_max - y_min) / 100
    xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, dhx),
                            numpy.arange(y_min, y_max, dhy))

    Z = clf.predict(numpy.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.5)
    ax.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k', lw=0.5)
    if title is not None:
        ax.set_title(title)


iris = load_iris()
X = iris.data[:, [0, 2]]
y = iris.target
y = y % 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, shuffle=True)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

lr = LogisticRegression()
lr.fit(X_train, y_train)

dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train, y_train)

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
plot_classifier_decision_zone(lr, X_test, y_test, ax=ax[0], title="LogisticRegression")
plot_classifier_decision_zone(dt, X_test, y_test, ax=ax[1], title="DecisionTreeClassifier")
../_images/decision_tree_logreg_6_0.png

The logistic regression is not very stable on this sort of problem. No linear separator can work on this dataset. Let’s dig into it.

DecisionTreeLogisticRegression

from mlinsights.mlmodel import DecisionTreeLogisticRegression

dtlr = DecisionTreeLogisticRegression(
    estimator=LogisticRegression(solver='liblinear'),
    min_samples_leaf=10, min_samples_split=10, max_depth=1,
    fit_improve_algo='none')
dtlr.fit(X_train, y_train)
dtlr2 = DecisionTreeLogisticRegression(
    estimator=LogisticRegression(solver='liblinear'),
    min_samples_leaf=4, min_samples_split=4, max_depth=10,
    fit_improve_algo='intercept_sort_always')
dtlr2.fit(X_train, y_train)

fig, ax = plt.subplots(2, 2, figsize=(10, 8))
plot_classifier_decision_zone(
    dtlr, X_train, y_train, ax=ax[0, 0],
    title="DecisionTreeLogisticRegression\ndepth=%d - train" % dtlr.tree_depth_)
plot_classifier_decision_zone(
    dtlr2, X_train, y_train, ax=ax[0, 1],
    title="DecisionTreeLogisticRegression\ndepth=%d - train" % dtlr2.tree_depth_)
plot_classifier_decision_zone(
    dtlr, X_test, y_test, ax=ax[1, 0],
    title="DecisionTreeLogisticRegression\ndepth=%d - test" % dtlr.tree_depth_)
plot_classifier_decision_zone(
    dtlr2, X_test, y_test, ax=ax[1, 1],
    title="DecisionTreeLogisticRegression\ndepth=%d - test" % dtlr2.tree_depth_)
../_images/decision_tree_logreg_9_0.png
from pandas import DataFrame

rows = []
for model in [lr, dt, dtlr, dtlr2]:
    val = (" - depth=%d" % model.tree_depth_) if hasattr(model, 'tree_depth_') else ""
    obs = dict(name="%s%s" % (model.__class__.__name__, val),
               score=model.score(X_test, y_test))
    rows.append(obs)

DataFrame(rows)
name score
0 LogisticRegression 0.644444
1 DecisionTreeClassifier 0.933333
2 DecisionTreeLogisticRegression - depth=1 0.700000
3 DecisionTreeLogisticRegression - depth=5 0.855556

A first example

import numpy
from scipy.spatial.distance import cdist


def random_set_simple(n):
    X = numpy.random.rand(n, 2)
    y = ((X[:, 0] ** 2 + X[:, 1] ** 2) <= 1).astype(numpy.int32).ravel()
    return X, y

X, y = random_set_simple(2000)
X_train, X_test, y_train, y_test = train_test_split(X, y)
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)
dt8 = DecisionTreeClassifier(max_depth=10)
dt8.fit(X_train, y_train)

fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
plot_classifier_decision_zone(dt, X_test, y_test, ax=ax[0],
                              title="DecisionTree - max_depth=%d\nacc=%1.2f" % (
                                  dt.max_depth, dt.score(X_test, y_test)))
plot_classifier_decision_zone(dt8, X_test, y_test, ax=ax[1],
                              title="DecisionTree - max_depth=%d\nacc=%1.2f" % (
                                  dt8.max_depth, dt8.score(X_test, y_test)))
ax[0].set_xlim([0, 1])
ax[1].set_xlim([0, 1])
ax[0].set_ylim([0, 1]);
../_images/decision_tree_logreg_12_0.png
dtlr = DecisionTreeLogisticRegression(
    max_depth=3, fit_improve_algo='intercept_sort_always', verbose=1)
dtlr.fit(X_train, y_train)
dtlr8 = DecisionTreeLogisticRegression(
    max_depth=10, min_samples_split=4, fit_improve_algo='intercept_sort_always')
dtlr8.fit(X_train, y_train)

fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
plot_classifier_decision_zone(dtlr, X_test, y_test, ax=ax[0],
                              title="DecisionTreeLogReg - depth=%d\nacc=%1.2f" % (
                                  dtlr.tree_depth_, dtlr.score(X_test, y_test)))
plot_classifier_decision_zone(dtlr8, X_test, y_test, ax=ax[1],
                              title="DecisionTreeLogReg - depth=%d\nacc=%1.2f" % (
                                  dtlr8.tree_depth_, dtlr8.score(X_test, y_test)))
ax[0].set_xlim([0, 1])
ax[1].set_xlim([0, 1])
ax[0].set_ylim([0, 1]);
[DTLR ]   trained acc 0.96 N=1500
[DTLRI]   change intercept 11.677031 --> 10.877451 in [0.278070, 16.549686]
[DTLR*]  above: n_class=2 N=1500 - 1106/1500
[DTLR ]    trained acc 0.99 N=1106
[DTLRI]    change intercept 6.021739 --> 1.840312 in [0.063825, 2.640076]
[DTLR*]   above: n_class=1 N=1106 - 743/1500
[DTLR*]   below: n_class=2 N=1106 - 363/1500
[DTLR ]     trained acc 0.96 N=363
[DTLRI]     change intercept 3.970377 --> 0.770538 in [0.461779, 0.985259]
[DTLR*]  below: n_class=2 N=1500 - 394/1500
[DTLR ]    trained acc 0.80 N=394
[DTLRI]    change intercept 4.763873 --> 5.983343 in [5.225083, 8.055335]
[DTLR*]   above: n_class=2 N=394 - 162/1500
[DTLR ]     trained acc 0.54 N=162
[DTLRI]     change intercept 1.289949 --> 1.351619 in [1.036507, 1.533679]
[DTLR*]   below: n_class=1 N=394 - 232/1500
../_images/decision_tree_logreg_13_1.png
from mlinsights.mltree import predict_leaves


def draw_border(clr, X, y, fct=None, incx=0.1, incy=0.1,
                figsize=None, border=True, ax=None,
                s=10., linewidths=0.1):

    _unused_ = ["Red", "Green", "Yellow", "Blue", "Orange", "Purple", "Cyan",
              "Magenta", "Lime", "Pink", "Teal", "Lavender", "Brown", "Beige",
              "Maroon", "Mint", "Olive", "Coral", "Navy", "Grey", "White", "Black"]

    h = .02
    x_min, x_max = X[:, 0].min() - incx, X[:, 0].max() + incx
    y_min, y_max = X[:, 1].min() - incy, X[:, 1].max() + incy
    xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, h),
                            numpy.arange(y_min, y_max, h))
    if fct is None:
        Z = clr.predict(numpy.c_[xx.ravel(), yy.ravel()])
    else:
        Z = fct(clr, numpy.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    cmap = plt.cm.tab20
    Z = Z.reshape(xx.shape)
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=figsize or (4, 3))
    ax.pcolormesh(xx, yy, Z, cmap=cmap)

    # Plot also the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k',
               cmap=cmap, s=s, linewidths=linewidths)

    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    return ax


fig, ax = plt.subplots(1, 2, figsize=(14,4))
draw_border(dt, X_test, y_test, border=False, ax=ax[0])
ax[0].set_title("Iris")
draw_border(dt, X, y, border=False, ax=ax[1],
            fct=lambda m, x: predict_leaves(m, x))
ax[1].set_title("DecisionTree");
../_images/decision_tree_logreg_14_0.png
from tqdm import tqdm

fig, ax = plt.subplots(6, 4, figsize=(12, 16))
for i, depth in tqdm(enumerate((1, 2, 3, 4, 5, 6))):
    dtl = DecisionTreeLogisticRegression(
        max_depth=depth, fit_improve_algo='intercept_sort_always',
        min_samples_leaf=2)
    dtl.fit(X_train, y_train)
    draw_border(dtl, X_test, y_test, border=False, ax=ax[i, 0], s=4.)
    draw_border(dtl, X, y, border=False, ax=ax[i, 1],
                fct=lambda m, x: predict_leaves(m, x), s=4.)
    ax[i, 0].set_title("Depth=%d nodes=%d score=%1.2f" % (
        dtl.tree_depth_, dtl.n_nodes_, dtl.score(X_test, y_test)))
    ax[i, 1].set_title("DTLR Leaves zones");

    dtl = DecisionTreeClassifier(max_depth=depth)
    dtl.fit(X_train, y_train)
    draw_border(dtl, X_test, y_test, border=False, ax=ax[i, 2], s=4.)
    draw_border(dtl, X, y, border=False, ax=ax[i, 3],
                fct=lambda m, x: predict_leaves(m, x), s=4.)
    ax[i, 2].set_title("Depth=%d nodes=%d score=%1.2f" % (
        dtl.max_depth, dtl.tree_.node_count, dtl.score(X_test, y_test)))
    ax[i, 3].set_title("DT Leaves zones");

    for k in range(ax.shape[1]):
        ax[i, k].get_xaxis().set_visible(False)
6it [00:02,  2.92it/s]
../_images/decision_tree_logreg_15_1.png

Another example designed to fail

Designed to be difficult with a regular decision tree.

from scipy.spatial.distance import cdist

def random_set(n):
    X = numpy.random.rand(n, 2)
    y = (cdist(X, numpy.array([[0.5, 0.5]]),
               metric='minkowski', p=1) <= 0.5).astype(numpy.int32).ravel()
    return X, y

X, y = random_set(2000)
X_train, X_test, y_train, y_test = train_test_split(X, y)
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)
dt8 = DecisionTreeClassifier(max_depth=10)
dt8.fit(X_train, y_train)

fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
plot_classifier_decision_zone(dt, X_test, y_test, ax=ax[0],
                              title="DecisionTree - max_depth=%d\nacc=%1.2f" % (
                                  dt.max_depth, dt.score(X_test, y_test)))
plot_classifier_decision_zone(dt8, X_test, y_test, ax=ax[1],
                              title="DecisionTree - max_depth=%d\nacc=%1.2f" % (
                                  dt8.max_depth, dt8.score(X_test, y_test)))
ax[0].set_xlim([0, 1])
ax[1].set_xlim([0, 1])
ax[0].set_ylim([0, 1]);
../_images/decision_tree_logreg_17_0.png

The example is a square rotated by 45 degrees. Every sample in the square is a positive sample, every sample outside is a negative one. The tree approximates the border with horizontal and vertical lines.

dtlr = DecisionTreeLogisticRegression(
    max_depth=3, fit_improve_algo='intercept_sort_always', verbose=1)
dtlr.fit(X_train, y_train)
dtlr8 = DecisionTreeLogisticRegression(
    max_depth=10, min_samples_split=4, fit_improve_algo='intercept_sort_always')
dtlr8.fit(X_train, y_train)

fig, ax = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
plot_classifier_decision_zone(dtlr, X_test, y_test, ax=ax[0],
                              title="DecisionTreeLogReg - depth=%d\nacc=%1.2f" % (
                                  dtlr.tree_depth_, dtlr.score(X_test, y_test)))
plot_classifier_decision_zone(dtlr8, X_test, y_test, ax=ax[1],
                              title="DecisionTreeLogReg - depth=%d\nacc=%1.2f" % (
                                  dtlr8.tree_depth_, dtlr8.score(X_test, y_test)))
ax[0].set_xlim([0, 1])
ax[1].set_xlim([0, 1])
ax[0].set_ylim([0, 1]);
[DTLR ]   trained acc 0.50 N=1500
[DTLRI]   change intercept 0.001126 --> 0.019908 in [0.001172, 0.038195]
[DTLR*]  above: n_class=2 N=1500 - 749/1500
[DTLR ]    trained acc 0.64 N=749
[DTLRI]    change intercept -1.972404 --> -2.003562 in [-3.382932, -0.149398]
[DTLR*]   above: n_class=2 N=749 - 377/1500
[DTLR ]     trained acc 0.64 N=377
[DTLRI]     change intercept 1.136431 --> 0.564497 in [0.399068, 0.831867]
[DTLR*]   below: n_class=2 N=749 - 372/1500
[DTLR ]     trained acc 0.77 N=372
[DTLRI]     change intercept -2.481437 --> -1.962176 in [-3.275774, -0.156925]
[DTLR*]  below: n_class=2 N=1500 - 751/1500
[DTLR ]    trained acc 0.66 N=751
[DTLRI]    change intercept 4.143107 --> 4.117942 in [2.662598, 6.063896]
[DTLR*]   above: n_class=2 N=751 - 388/1500
[DTLR ]     trained acc 0.64 N=388
[DTLRI]     change intercept -0.412468 --> -0.999464 in [-1.346126, -0.659144]
[DTLR*]   below: n_class=2 N=751 - 363/1500
[DTLR ]     trained acc 0.75 N=363
[DTLRI]     change intercept 5.485085 --> 6.009627 in [5.307328, 7.827812]
../_images/decision_tree_logreg_19_1.png

Leave zones

We use method decision_path to understand which leaf is responsible for which zone.

fig, ax = plt.subplots(1, 2, figsize=(14,4))
draw_border(dtlr, X_test, y_test, border=False, ax=ax[0])
ax[0].set_title("Iris")
draw_border(dtlr, X, y, border=False, ax=ax[1],
            fct=lambda m, x: predict_leaves(m, x))
ax[1].set_title("DecisionTreeLogisticRegression");
../_images/decision_tree_logreg_21_0.png
from tqdm import tqdm

fig, ax = plt.subplots(6, 4, figsize=(12, 16))
for i, depth in tqdm(enumerate((1, 2, 3, 4, 5, 6))):
    dtl = DecisionTreeLogisticRegression(
        max_depth=depth, fit_improve_algo='intercept_sort_always',
        min_samples_leaf=2)
    dtl.fit(X_train, y_train)
    draw_border(dtl, X_test, y_test, border=False, ax=ax[i, 0], s=4.)
    draw_border(dtl, X, y, border=False, ax=ax[i, 1],
                fct=lambda m, x: predict_leaves(m, x), s=4.)
    ax[i, 0].set_title("Depth=%d nodes=%d score=%1.2f" % (
        dtl.tree_depth_, dtl.n_nodes_, dtl.score(X_test, y_test)))
    ax[i, 1].set_title("DTLR Leaves zones");

    dtl = DecisionTreeClassifier(max_depth=depth)
    dtl.fit(X_train, y_train)
    draw_border(dtl, X_test, y_test, border=False, ax=ax[i, 2], s=4.)
    draw_border(dtl, X, y, border=False, ax=ax[i, 3],
                fct=lambda m, x: predict_leaves(m, x), s=4.)
    ax[i, 2].set_title("Depth=%d nodes=%d score=%1.2f" % (
        dtl.max_depth, dtl.tree_.node_count, dtl.score(X_test, y_test)))
    ax[i, 3].set_title("DT Leaves zones");
6it [00:02,  2.29it/s]
../_images/decision_tree_logreg_22_1.png