.. _06unsuperviseddimreductionrst:

=============================================================================
2A.ML101.6: Unsupervised Learning: Dimensionality Reduction and Visualization
=============================================================================


.. only:: html

    **Links:** :download:`notebook <06_unsupervised_dimreduction.ipynb>`, :downloadlink:`html <06_unsupervised_dimreduction2html.html>`, :download:`python <06_unsupervised_dimreduction.py>`, :downloadlink:`slides <06_unsupervised_dimreduction.slides.html>`, :githublink:`GitHub|_doc/notebooks/sklearn_ensae_course/06_unsupervised_dimreduction.ipynb|*`


Unsupervised learning is interested in situations in which X is
available, but not y: data without labels. A typical use case is to find
hiden structure in the data.

*Source:* `Course on machine learning with
scikit-learn <https://github.com/GaelVaroquaux/sklearn_ensae_course>`__
by Gaël Varoquaux

Dimensionality Reduction: PCA
-----------------------------

Dimensionality reduction is the task of deriving a set of new artificial
features that is smaller than the original feature set while retaining
most of the variance of the original data. Here we’ll use a common but
powerful dimensionality reduction technique called Principal Component
Analysis (PCA). We’ll perform PCA on the iris dataset that we saw
before:

.. code:: ipython3

    from sklearn.datasets import load_iris
    iris = load_iris()
    X = iris.data
    y = iris.target

PCA is performed using linear combinations of the original features
using a truncated Singular Value Decomposition of the matrix X so as to
project the data onto a base of the top singular vectors. If the number
of retained components is 2 or 3, PCA can be used to visualize the
dataset.

.. code:: ipython3

    from sklearn.decomposition import PCA
    pca = PCA(n_components=2, whiten=True)
    pca.fit(X)


.. parsed-literal::
    PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
      svd_solver='auto', tol=0.0, whiten=True)


Once fitted, the pca model exposes the singular vectors in the
components\_ attribute:

.. code:: ipython3

    pca.components_


.. parsed-literal::
    array([[ 0.36158968, -0.08226889,  0.85657211,  0.35884393],
           [ 0.65653988,  0.72971237, -0.1757674 , -0.07470647]])


Other attributes are available as well:

.. code:: ipython3

    pca.explained_variance_ratio_


.. parsed-literal::
    array([0.92461621, 0.05301557])


.. code:: ipython3

    pca.explained_variance_ratio_.sum()


.. parsed-literal::
    0.9776317750248034


Let us project the iris dataset along those first two dimensions:

.. code:: ipython3

    X_pca = pca.transform(X)

PCA ``normalizes`` and ``whitens`` the data, which means that the data
is now centered on both components with unit variance:

.. code:: ipython3

    X_pca.mean(axis=0)


.. parsed-literal::
    array([-1.30044124e-15, -1.69790108e-15])


.. code:: ipython3

    X_pca.std(axis=0)


.. parsed-literal::
    array([0.99666109, 0.99666109])


Furthermore, the samples components do no longer carry any linear
correlation:

.. code:: ipython3

    import numpy as np
    np.corrcoef(X_pca.T)


.. parsed-literal::
    array([[1.00000000e+00, 6.51918477e-16],
           [6.51918477e-16, 1.00000000e+00]])


We can visualize the projection using pylab

.. code:: ipython3

    %matplotlib inline
    import matplotlib.pyplot as plt
    
    target_ids = range(len(iris.target_names))
    plt.figure()
    for i, c, label in zip(target_ids, 'rgbcmykw', iris.target_names):
        plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
                   c=c, label=label)
    plt.legend();


.. image:: 06_unsupervised_dimreduction_20_0.png


Note that this projection was determined *without* any information about
the labels (represented by the colors): this is the sense in which the
learning is **unsupervised**. Nevertheless, we see that the projection
gives us insight into the distribution of the different flowers in
parameter space: notably, *iris setosa* is much more distinct than the
other two species.

Note also that the default implementation of PCA computes the singular
value decomposition (SVD) of the full data matrix, which is not scalable
when both ``n_samples`` and ``n_features`` are big (more that a few
thousands). If you are interested in a number of components that is much
smaller than both ``n_samples`` and ``n_features``, consider using
``sklearn.decomposition.RandomizedPCA`` instead.

Manifold Learning
-----------------

One weakness of PCA is that it cannot detect non-linear features. A set
of algorithms known as *Manifold Learning* have been developed to
address this deficiency. A canonical dataset used in Manifold learning
is the *S-curve*, which we briefly saw in an earlier section:

.. code:: ipython3

    from sklearn.datasets import make_s_curve
    X, y = make_s_curve(n_samples=1000)
    
    from mpl_toolkits.mplot3d import Axes3D
    ax = plt.axes(projection='3d')
    
    ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)
    ax.view_init(10, -60)


.. image:: 06_unsupervised_dimreduction_25_0.png


This is a 2-dimensional dataset embedded in three dimensions, but it is
embedded in such a way that PCA cannot discover the underlying data
orientation:

.. code:: ipython3

    X_pca = PCA(n_components=2).fit_transform(X)
    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y);


.. image:: 06_unsupervised_dimreduction_27_0.png


Manifold learning algorithms, however, available in the
``sklearn.manifold`` submodule, are able to recover the underlying
2-dimensional manifold:

.. code:: ipython3

    from sklearn.manifold import LocallyLinearEmbedding, Isomap
    lle = LocallyLinearEmbedding(n_neighbors=15, n_components=2, method='modified')
    X_lle = lle.fit_transform(X)
    plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y);


.. image:: 06_unsupervised_dimreduction_29_0.png


.. code:: ipython3

    iso = Isomap(n_neighbors=15, n_components=2)
    X_iso = iso.fit_transform(X)
    plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y);


.. image:: 06_unsupervised_dimreduction_30_0.png


Exercise: Dimension reduction of digits
---------------------------------------

Apply PCA, LocallyLinearEmbedding, and Isomap to project the data to two
dimensions. Which visualization technique separates the classes most
cleanly?

.. code:: ipython3

    from sklearn.datasets import load_digits
    digits = load_digits()
    # ...


Solution:
~~~~~~~~~

.. code:: ipython3

    from sklearn.decomposition import PCA
    from sklearn.manifold import Isomap, LocallyLinearEmbedding
    
    plt.figure(figsize=(14, 4))
    for i, est in enumerate([PCA(n_components=2, whiten=True),
                             Isomap(n_components=2, n_neighbors=10),
                             LocallyLinearEmbedding(n_components=2, n_neighbors=10, method='modified')]):
        plt.subplot(131 + i)
        projection = est.fit_transform(digits.data)
        plt.scatter(projection[:, 0], projection[:, 1], c=digits.target)
        plt.title(est.__class__.__name__)


.. image:: 06_unsupervised_dimreduction_36_0.png