{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Machine learning avec des cat\u00e9gories et du texte\n", "\n", "Le jeu [Titanic](http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets) contient la liste des passages du Titanic, quelques informations comme le billet, la classe, et le fait qu'ils aient surv\u00e9cu. Il est possible d'utiliser le machine learning pour \u00e9tudier la probabilit\u00e9 de survie ou plut\u00f4t de comprendre un peu mieux qui a eu la chance de survivre. S'il est possible de pr\u00e9dire, alors il existe une sorte de r\u00e8gle."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Les donn\u00e9es\n", "\n", "On peut les r\u00e9cup\u00e9rer manuellement ou utiliser la fonction *load_titanic_dataset*. On pourra prendre l'un des deux jeux suivants."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", " \n", " | \n", " pclass | \n", " survived | \n", " name | \n", " sex | \n", " age | \n", " sibsp | \n", " parch | \n", " ticket | \n", " fare | \n", " cabin | \n", " embarked | \n", " boat | \n", " body | \n", " home.dest | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 1 | \n", " 1 | \n", " Allen, Miss. Elisabeth Walton | \n", " female | \n", " 29.00 | \n", " 0 | \n", " 0 | \n", " 24160 | \n", " 211.3375 | \n", " B5 | \n", " S | \n", " 2 | \n", " NaN | \n", " St Louis, MO | \n", "
\n", " \n", " 1 | \n", " 1 | \n", " 1 | \n", " Allison, Master. Hudson Trevor | \n", " male | \n", " 0.92 | \n", " 1 | \n", " 2 | \n", " 113781 | \n", " 151.5500 | \n", " C22 C26 | \n", " S | \n", " 11 | \n", " NaN | \n", " Montreal, PQ / Chesterville, ON | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" pclass survived name sex age sibsp \\\n", "0 1 1 Allen, Miss. Elisabeth Walton female 29.00 0 \n", "1 1 1 Allison, Master. Hudson Trevor male 0.92 1 \n", "\n", " parch ticket fare cabin embarked boat body \\\n", "0 0 24160 211.3375 B5 S 2 NaN \n", "1 2 113781 151.5500 C22 C26 S 11 NaN \n", "\n", " home.dest \n", "0 St Louis, MO \n", "1 Montreal, PQ / Chesterville, ON "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from papierstat.datasets import load_titanic_dataset\n", "data1 = load_titanic_dataset(subset=\"A\")\n", "data1.head(n=2)"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " row.names | \n", " pclass | \n", " survived | \n", " name | \n", " age | \n", " embarked | \n", " home.dest | \n", " room | \n", " ticket | \n", " boat | \n", " sex | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 1 | \n", " 1st | \n", " 1 | \n", " Allen, Miss Elisabeth Walton | \n", " 29.0 | \n", " Southampton | \n", " St Louis, MO | \n", " B-5 | \n", " 24160 L221 | \n", " 2 | \n", " female | \n", "
\n", " \n", " 1 | \n", " 2 | \n", " 1st | \n", " 0 | \n", " Allison, Miss Helen Loraine | \n", " 2.0 | \n", " Southampton | \n", " Montreal, PQ / Chesterville, ON | \n", " C26 | \n", " NaN | \n", " NaN | \n", " female | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" row.names pclass survived name age \\\n", "0 1 1st 1 Allen, Miss Elisabeth Walton 29.0 \n", "1 2 1st 0 Allison, Miss Helen Loraine 2.0 \n", "\n", " embarked home.dest room ticket boat sex \n", "0 Southampton St Louis, MO B-5 24160 L221 2 female \n", "1 Southampton Montreal, PQ / Chesterville, ON C26 NaN NaN female "]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["data2 = load_titanic_dataset(subset=\"B\")\n", "data2.head(n=2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Premier mod\u00e8le de pr\u00e9diction\n", "\n", "On veut savoir si la survie de chaque personne \u00e9tait plut\u00f4t al\u00e9atoire o\u00f9 certaines personnes ont \u00e9t\u00e9 privil\u00e9gi\u00e9es. Plut\u00f4t que de se lancer dans une \u00e9tude de statistique descriptive, on cale un mod\u00e8le de pr\u00e9diction. Si celui-ci fonctionne, cela signifie qu'il existe un lien entre la survie et certaines des informations connues sur chaque passager. On consid\u00e8re les variables ``age, sex, pclass``."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["df = data1"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "\n", "X = df[[\"age\", \"sex\", \"pclass\"]]\n", "y = df['survived']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["({'female', 'male'}, {1, 2, 3})"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["set(df.sex), set(df.pclass)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il faut d'abord transformer les variables cat\u00e9gorielles en variables num\u00e9riques. C'est le r\u00f4le d'un [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) : chaque cat\u00e9gorie devient une variable binaire."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["OneHotEncoder(categories='auto', drop=None, dtype=,\n", " handle_unknown='error', sparse=True)"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import OneHotEncoder\n", "one = OneHotEncoder()\n", "one.fit(X_train[['sex', 'pclass']])"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["<981x5 sparse matrix of type ''\n", "\twith 1962 stored elements in Compressed Sparse Row format>"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["one.transform(X_train[['sex', 'pclass']])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le r\u00e9sultat est une matrice sparse ou creuse qui ne contient que les valeurs non nulles. Pour voir la matrice, il faut la rendre dense."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[1., 0., 0., 1., 0.],\n", " [0., 1., 0., 0., 1.],\n", " [0., 1., 1., 0., 0.],\n", " ...,\n", " [1., 0., 0., 0., 1.],\n", " [0., 1., 0., 0., 1.],\n", " [0., 1., 0., 0., 1.]])"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["one.transform(X_train[['sex', 'pclass']]).todense()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cinq colonnes correspondant aux cinq cat\u00e9gories dont les noms sont les suivants :"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["array(['x0_female', 'x0_male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["names = one.get_feature_names()\n", "names"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[31., 1., 0., 0., 1., 0.],\n", " [28., 0., 1., 0., 0., 1.],\n", " [21., 0., 1., 1., 0., 0.],\n", " [28., 0., 1., 0., 0., 1.],\n", " [nan, 0., 1., 0., 1., 0.],\n", " [80., 0., 1., 1., 0., 0.],\n", " [39., 0., 1., 1., 0., 0.],\n", " [12., 1., 0., 0., 1., 0.],\n", " [nan, 0., 1., 1., 0., 0.],\n", " [31., 1., 0., 0., 1., 0.]])"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "cats = one.transform(X_train[['sex', 'pclass']]).todense()\n", "age = X_train[['age']]\n", "feat = numpy.hstack([age, cats])\n", "feat[:10]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La colonne `age` contient des valeurs manquantes. On les remplace par la moyenne."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["((1309, 1), (1046, 1))"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["df[['age']].shape, df[['age']].dropna().shape"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[31. , 1. , 0. , 0. , 1. ,\n", " 0. ],\n", " [28. , 0. , 1. , 0. , 0. ,\n", " 1. ],\n", " [21. , 0. , 1. , 1. , 0. ,\n", " 0. ],\n", " [28. , 0. , 1. , 0. , 0. ,\n", " 1. ],\n", " [29.60563707, 0. , 1. , 0. , 1. ,\n", " 0. ],\n", " [80. , 0. , 1. , 1. , 0. ,\n", " 0. ],\n", " [39. , 0. , 1. , 1. , 0. ,\n", " 0. ],\n", " [12. , 1. , 0. , 0. , 1. ,\n", " 0. ],\n", " [29.60563707, 0. , 1. , 1. , 0. ,\n", " 0. ],\n", " [31. , 1. , 0. , 0. , 1. ,\n", " 0. ]])"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.impute import SimpleImputer\n", "imp = SimpleImputer()\n", "new_age = imp.fit_transform(X_train[['age']])\n", "feat = numpy.hstack([new_age, cats])\n", "feat[:10]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut maintenant caler une for\u00eat al\u00e9atoire."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", " criterion='gini', max_depth=None, max_features='auto',\n", " max_leaf_nodes=None, max_samples=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " n_jobs=None, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.ensemble import RandomForestClassifier\n", "rf = RandomForestClassifier(n_estimators=100)\n", "rf.fit(feat, y_train)"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": ["cats_test = one.transform(X_test[['sex', 'pclass']])\n", "age_test = imp.transform(X_test[['age']])\n", "feat_test = numpy.hstack([age_test, cats_test.todense()])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On regarde les pr\u00e9dictions sur la base de test et \u00e7a a l'air de marcher."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[177, 25],\n", " [ 49, 77]], dtype=int64)"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import confusion_matrix\n", "\n", "pred = rf.predict(feat_test)\n", "confusion_matrix(y_test, pred)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La survie n'est pas donc al\u00e9atoire. On met tout cela dans un pipeline car c'est toujours plus simple \u00e0 lire."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 0. , 1. , 0. , 1. , 0. ,\n", " 30. ],\n", " [ 0. , 1. , 0. , 0. , 1. ,\n", " 29.60563707],\n", " [ 0. , 1. , 0. , 0. , 1. ,\n", " 29.60563707],\n", " ...,\n", " [ 0. , 1. , 0. , 0. , 1. ,\n", " 25. ],\n", " [ 1. , 0. , 1. , 0. , 0. ,\n", " 53. ],\n", " [ 1. , 0. , 1. , 0. , 0. ,\n", " 17. ]])"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "\n", "pipe = Pipeline([\n", " ('cats', ColumnTransformer([\n", " ('one', OneHotEncoder(), ['sex', 'pclass']),\n", " ('imp', SimpleImputer(), ['age'])\n", " ]))\n", "])\n", "pipe.fit(X_train)\n", "pipe.transform(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On modifie le pipeline pour ajouter la for\u00eat al\u00e9atoire."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[174, 28],\n", " [ 47, 79]], dtype=int64)"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe = Pipeline([\n", " ('cats', ColumnTransformer([\n", " ('one', OneHotEncoder(), ['sex', 'pclass']),\n", " ('imp', SimpleImputer(), ['age'])\n", " ])),\n", " ('rf', RandomForestClassifier(n_estimators=100))\n", "])\n", "\n", "pipe.fit(X_train, y_train)\n", "confusion_matrix(y_test, pipe.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est plus clair. Quelques explications sur les variables, mais il faut v\u00e9rifier en les retirant une \u00e0 une pour v\u00e9rifier leur importance dans l'histoire."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('age', 0.2279931985187704),\n", " ('x0_female', 0.21516489235386238),\n", " ('x0_male', 0.04301082695552079),\n", " ('x1_1', 0.019360189479876846),\n", " ('x1_2', 0.06817992681764856),\n", " ('x1_3', 0.426290965874321)]"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["cols = ['age'] + list(names)\n", "list(zip(cols, pipe.steps[-1][1].feature_importances_))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Utiliser d'autres variables\n", "\n", "La variable `ticket` contient d'autres informations mais comme chaque ticket est unique, il est difficile de l'utiliser telle quelle. Il faut chercher des redondances d'une ligne \u00e0 l'autre autrement c'est inexploitable. Et la redondance vient des mots que cette colonne contient. Il faut d\u00e9couper en mots :\n", "\n", "* On fait l'inventaire de tous les mots uniques : ce sont des cat\u00e9gories.\n", "* On cr\u00e9e une variable par mot : 1 si l'expression contient le mot, 0 sinon.\n", "\n", "C'est une approche dite *bag of words* ou *sac de mots*."]}, {"cell_type": "code", "execution_count": 21, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["['SOTON/O.Q. 3101310',\n", " 'A/4 48873',\n", " 'STON/O2. 3101283',\n", " 'PC 17611',\n", " 'CA 31352',\n", " 'C 17368',\n", " 'A/5 3594',\n", " 'W./C. 6608',\n", " 'C.A. 34050',\n", " 'C.A. 24580']"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["[_ for _ in set(df.ticket) if ' ' in _][:10]"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": ["X = df[[\"age\", \"sex\", \"pclass\", \"ticket\"]]\n", "y = df['survived']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "code", "execution_count": 23, "metadata": {"scrolled": false}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 981 and the array at index 2 has size 1\n"]}], "source": ["from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "pipe = Pipeline([\n", " ('cats', ColumnTransformer([\n", " ('one', OneHotEncoder(), ['sex', 'pclass']),\n", " ('imp', SimpleImputer(), ['age']),\n", " ('bow', CountVectorizer(), ['ticket']),\n", " ])),\n", "])\n", "\n", "try:\n", " pipe.fit(X_train)\n", "except ValueError as e:\n", " print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le mod\u00e8le [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) est un peu probl\u00e8matique car il ne traite qu'une seule colonne. Il faut ruser pour l'utiliser dans un pipeline."]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["<981x756 sparse matrix of type ''\n", "\twith 1170 stored elements in Compressed Sparse Row format>"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["CountVectorizer().fit_transform(X_train['ticket'])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On essaye un petit tour de magie avec la classe *TextVectorizerTransformer*."]}, {"cell_type": "code", "execution_count": 25, "metadata": {"scrolled": false}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\xavierdupre\\__home_\\github_fork\\scikit-learn\\sklearn\\utils\\deprecation.py:144: FutureWarning: The sklearn.cluster.k_means_ module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.\n", " warnings.warn(message, FutureWarning)\n"]}, {"data": {"text/plain": ["Pipeline(memory=None,\n", " steps=[('cats',\n", " ColumnTransformer(n_jobs=None, remainder='drop',\n", " sparse_threshold=0.3,\n", " transformer_weights=None,\n", " transformers=[('one',\n", " OneHotEncoder(categories='auto',\n", " drop=None,\n", " dtype=,\n", " handle_unknown='error',\n", " sparse=True),\n", " ['sex', 'pclass']),\n", " ('imp',\n", " SimpleImputer(add_indicator=False,\n", " copy=True,\n", " fill_value=None,\n", " missing_...\n", " TextVectorizerTransformer(estimator=CountVectorizer(analyzer='word',\n", " binary=False,\n", " decode_error='strict',\n", " dtype=,\n", " encoding='utf-8',\n", " input='content',\n", " lowercase=True,\n", " max_df=1.0,\n", " max_features=None,\n", " min_df=1,\n", " ngram_range=(1,\n", " 1),\n", " preprocessor=None,\n", " stop_words=None,\n", " strip_accents=None,\n", " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None,\n", " vocabulary=None)),\n", " ['ticket'])],\n", " verbose=False))],\n", " verbose=False)"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["from papierstat.mltricks import TextVectorizerTransformer\n", "\n", "pipe = Pipeline([\n", " ('cats', ColumnTransformer([\n", " ('one', OneHotEncoder(), ['sex', 'pclass']),\n", " ('imp', SimpleImputer(), ['age']),\n", " ('bow', TextVectorizerTransformer(CountVectorizer()), ['ticket']),\n", " ])),\n", "])\n", "\n", "pipe.fit(X_train)"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"data": {"text/plain": ["<328x762 sparse matrix of type ''\n", "\twith 1174 stored elements in Compressed Sparse Row format>"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe.transform(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut ajouter un classifier au pipeline."]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[191, 14],\n", " [ 35, 88]], dtype=int64)"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe = Pipeline([\n", " ('cats', ColumnTransformer([\n", " ('one', OneHotEncoder(), ['sex', 'pclass']),\n", " ('imp', SimpleImputer(), ['age']),\n", " ('bow', TextVectorizerTransformer(CountVectorizer()), ['ticket']),\n", " ])),\n", " ('rf', RandomForestClassifier(n_estimators=100))\n", "])\n", "\n", "pipe.fit(X_train, y_train)\n", "confusion_matrix(y_test, pipe.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est un meilleur mod\u00e8le."]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 4}