{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Machine learning avec des cat\u00e9gories et du texte\n", "\n", "Le jeu [Titanic](http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets) contient la liste des passages du Titanic, quelques informations comme le billet, la classe, et le fait qu'ils aient surv\u00e9cu. Il est possible d'utiliser le machine learning pour \u00e9tudier la probabilit\u00e9 de survie ou plut\u00f4t de comprendre un peu mieux qui a eu la chance de survivre. S'il est possible de pr\u00e9dire, alors il existe une sorte de r\u00e8gle."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Les donn\u00e9es\n", "\n", "On peut les r\u00e9cup\u00e9rer manuellement ou utiliser la fonction *load_titanic_dataset*. On pourra prendre l'un des deux jeux suivants."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>pclass</th>\n", "      <th>survived</th>\n", "      <th>name</th>\n", "      <th>sex</th>\n", "      <th>age</th>\n", "      <th>sibsp</th>\n", "      <th>parch</th>\n", "      <th>ticket</th>\n", "      <th>fare</th>\n", "      <th>cabin</th>\n", "      <th>embarked</th>\n", "      <th>boat</th>\n", "      <th>body</th>\n", "      <th>home.dest</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>1</td>\n", "      <td>1</td>\n", "      <td>Allen, Miss. Elisabeth Walton</td>\n", "      <td>female</td>\n", "      <td>29.00</td>\n", "      <td>0</td>\n", "      <td>0</td>\n", "      <td>24160</td>\n", "      <td>211.3375</td>\n", "      <td>B5</td>\n", "      <td>S</td>\n", "      <td>2</td>\n", "      <td>NaN</td>\n", "      <td>St Louis, MO</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>1</td>\n", "      <td>1</td>\n", "      <td>Allison, Master. Hudson Trevor</td>\n", "      <td>male</td>\n", "      <td>0.92</td>\n", "      <td>1</td>\n", "      <td>2</td>\n", "      <td>113781</td>\n", "      <td>151.5500</td>\n", "      <td>C22 C26</td>\n", "      <td>S</td>\n", "      <td>11</td>\n", "      <td>NaN</td>\n", "      <td>Montreal, PQ / Chesterville, ON</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   pclass  survived                            name     sex    age  sibsp  \\\n", "0       1         1   Allen, Miss. Elisabeth Walton  female  29.00      0   \n", "1       1         1  Allison, Master. Hudson Trevor    male   0.92      1   \n", "\n", "   parch  ticket      fare    cabin embarked boat  body  \\\n", "0      0   24160  211.3375       B5        S    2   NaN   \n", "1      2  113781  151.5500  C22 C26        S   11   NaN   \n", "\n", "                         home.dest  \n", "0                     St Louis, MO  \n", "1  Montreal, PQ / Chesterville, ON  "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from papierstat.datasets import load_titanic_dataset\n", "data1 = load_titanic_dataset(subset=\"A\")\n", "data1.head(n=2)"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>row.names</th>\n", "      <th>pclass</th>\n", "      <th>survived</th>\n", "      <th>name</th>\n", "      <th>age</th>\n", "      <th>embarked</th>\n", "      <th>home.dest</th>\n", "      <th>room</th>\n", "      <th>ticket</th>\n", "      <th>boat</th>\n", "      <th>sex</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>1</td>\n", "      <td>1st</td>\n", "      <td>1</td>\n", "      <td>Allen, Miss Elisabeth Walton</td>\n", "      <td>29.0</td>\n", "      <td>Southampton</td>\n", "      <td>St Louis, MO</td>\n", "      <td>B-5</td>\n", "      <td>24160 L221</td>\n", "      <td>2</td>\n", "      <td>female</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>2</td>\n", "      <td>1st</td>\n", "      <td>0</td>\n", "      <td>Allison, Miss Helen Loraine</td>\n", "      <td>2.0</td>\n", "      <td>Southampton</td>\n", "      <td>Montreal, PQ / Chesterville, ON</td>\n", "      <td>C26</td>\n", "      <td>NaN</td>\n", "      <td>NaN</td>\n", "      <td>female</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   row.names pclass  survived                          name   age  \\\n", "0          1    1st         1  Allen, Miss Elisabeth Walton  29.0   \n", "1          2    1st         0   Allison, Miss Helen Loraine   2.0   \n", "\n", "      embarked                        home.dest room      ticket boat     sex  \n", "0  Southampton                     St Louis, MO  B-5  24160 L221    2  female  \n", "1  Southampton  Montreal, PQ / Chesterville, ON  C26         NaN  NaN  female  "]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["data2 = load_titanic_dataset(subset=\"B\")\n", "data2.head(n=2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Premier mod\u00e8le de pr\u00e9diction\n", "\n", "On veut savoir si la survie de chaque personne \u00e9tait plut\u00f4t al\u00e9atoire o\u00f9 certaines personnes ont \u00e9t\u00e9 privil\u00e9gi\u00e9es. Plut\u00f4t que de se lancer dans une \u00e9tude de statistique descriptive, on cale un mod\u00e8le de pr\u00e9diction. Si celui-ci fonctionne, cela signifie qu'il existe un lien entre la survie et certaines des informations connues sur chaque passager. On consid\u00e8re les variables ``age, sex, pclass``."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["df = data1"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "\n", "X = df[[\"age\", \"sex\", \"pclass\"]]\n", "y = df['survived']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["({'female', 'male'}, {1, 2, 3})"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["set(df.sex), set(df.pclass)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il faut d'abord transformer les variables cat\u00e9gorielles en variables num\u00e9riques. C'est le r\u00f4le d'un [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) : chaque cat\u00e9gorie devient une variable binaire."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,\n", "              handle_unknown='error', sparse=True)"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import OneHotEncoder\n", "one = OneHotEncoder()\n", "one.fit(X_train[['sex', 'pclass']])"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["<981x5 sparse matrix of type '<class 'numpy.float64'>'\n", "\twith 1962 stored elements in Compressed Sparse Row format>"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["one.transform(X_train[['sex', 'pclass']])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le r\u00e9sultat est une matrice sparse ou creuse qui ne contient que les valeurs non nulles. Pour voir la matrice, il faut la rendre dense."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[1., 0., 0., 1., 0.],\n", "        [0., 1., 0., 0., 1.],\n", "        [0., 1., 1., 0., 0.],\n", "        ...,\n", "        [1., 0., 0., 0., 1.],\n", "        [0., 1., 0., 0., 1.],\n", "        [0., 1., 0., 0., 1.]])"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["one.transform(X_train[['sex', 'pclass']]).todense()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cinq colonnes correspondant aux cinq cat\u00e9gories dont les noms sont les suivants :"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["array(['x0_female', 'x0_male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["names = one.get_feature_names()\n", "names"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[31.,  1.,  0.,  0.,  1.,  0.],\n", "        [28.,  0.,  1.,  0.,  0.,  1.],\n", "        [21.,  0.,  1.,  1.,  0.,  0.],\n", "        [28.,  0.,  1.,  0.,  0.,  1.],\n", "        [nan,  0.,  1.,  0.,  1.,  0.],\n", "        [80.,  0.,  1.,  1.,  0.,  0.],\n", "        [39.,  0.,  1.,  1.,  0.,  0.],\n", "        [12.,  1.,  0.,  0.,  1.,  0.],\n", "        [nan,  0.,  1.,  1.,  0.,  0.],\n", "        [31.,  1.,  0.,  0.,  1.,  0.]])"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "cats = one.transform(X_train[['sex', 'pclass']]).todense()\n", "age = X_train[['age']]\n", "feat = numpy.hstack([age, cats])\n", "feat[:10]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La colonne `age` contient des valeurs manquantes. On les remplace par la moyenne."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["((1309, 1), (1046, 1))"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["df[['age']].shape, df[['age']].dropna().shape"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[31.        ,  1.        ,  0.        ,  0.        ,  1.        ,\n", "          0.        ],\n", "        [28.        ,  0.        ,  1.        ,  0.        ,  0.        ,\n", "          1.        ],\n", "        [21.        ,  0.        ,  1.        ,  1.        ,  0.        ,\n", "          0.        ],\n", "        [28.        ,  0.        ,  1.        ,  0.        ,  0.        ,\n", "          1.        ],\n", "        [29.60563707,  0.        ,  1.        ,  0.        ,  1.        ,\n", "          0.        ],\n", "        [80.        ,  0.        ,  1.        ,  1.        ,  0.        ,\n", "          0.        ],\n", "        [39.        ,  0.        ,  1.        ,  1.        ,  0.        ,\n", "          0.        ],\n", "        [12.        ,  1.        ,  0.        ,  0.        ,  1.        ,\n", "          0.        ],\n", "        [29.60563707,  0.        ,  1.        ,  1.        ,  0.        ,\n", "          0.        ],\n", "        [31.        ,  1.        ,  0.        ,  0.        ,  1.        ,\n", "          0.        ]])"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.impute import SimpleImputer\n", "imp = SimpleImputer()\n", "new_age = imp.fit_transform(X_train[['age']])\n", "feat = numpy.hstack([new_age, cats])\n", "feat[:10]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut maintenant caler une for\u00eat al\u00e9atoire."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", "                       criterion='gini', max_depth=None, max_features='auto',\n", "                       max_leaf_nodes=None, max_samples=None,\n", "                       min_impurity_decrease=0.0, min_impurity_split=None,\n", "                       min_samples_leaf=1, min_samples_split=2,\n", "                       min_weight_fraction_leaf=0.0, n_estimators=100,\n", "                       n_jobs=None, oob_score=False, random_state=None,\n", "                       verbose=0, warm_start=False)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.ensemble import RandomForestClassifier\n", "rf = RandomForestClassifier(n_estimators=100)\n", "rf.fit(feat, y_train)"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": ["cats_test = one.transform(X_test[['sex', 'pclass']])\n", "age_test = imp.transform(X_test[['age']])\n", "feat_test = numpy.hstack([age_test, cats_test.todense()])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On regarde les pr\u00e9dictions sur la base de test et \u00e7a a l'air de marcher."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[177,  25],\n", "       [ 49,  77]], dtype=int64)"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import confusion_matrix\n", "\n", "pred = rf.predict(feat_test)\n", "confusion_matrix(y_test, pred)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La survie n'est pas donc al\u00e9atoire. On met tout cela dans un pipeline car c'est toujours plus simple \u00e0 lire."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 0.        ,  1.        ,  0.        ,  1.        ,  0.        ,\n", "        30.        ],\n", "       [ 0.        ,  1.        ,  0.        ,  0.        ,  1.        ,\n", "        29.60563707],\n", "       [ 0.        ,  1.        ,  0.        ,  0.        ,  1.        ,\n", "        29.60563707],\n", "       ...,\n", "       [ 0.        ,  1.        ,  0.        ,  0.        ,  1.        ,\n", "        25.        ],\n", "       [ 1.        ,  0.        ,  1.        ,  0.        ,  0.        ,\n", "        53.        ],\n", "       [ 1.        ,  0.        ,  1.        ,  0.        ,  0.        ,\n", "        17.        ]])"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "\n", "pipe = Pipeline([\n", "    ('cats', ColumnTransformer([\n", "        ('one', OneHotEncoder(), ['sex', 'pclass']),\n", "        ('imp', SimpleImputer(), ['age'])\n", "    ]))\n", "])\n", "pipe.fit(X_train)\n", "pipe.transform(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On modifie le pipeline pour ajouter la for\u00eat al\u00e9atoire."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[174,  28],\n", "       [ 47,  79]], dtype=int64)"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe = Pipeline([\n", "    ('cats', ColumnTransformer([\n", "        ('one', OneHotEncoder(), ['sex', 'pclass']),\n", "        ('imp', SimpleImputer(), ['age'])\n", "    ])),\n", "    ('rf', RandomForestClassifier(n_estimators=100))\n", "])\n", "\n", "pipe.fit(X_train, y_train)\n", "confusion_matrix(y_test, pipe.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est plus clair. Quelques explications sur les variables, mais il faut v\u00e9rifier en les retirant une \u00e0 une pour v\u00e9rifier leur importance dans l'histoire."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('age', 0.2279931985187704),\n", " ('x0_female', 0.21516489235386238),\n", " ('x0_male', 0.04301082695552079),\n", " ('x1_1', 0.019360189479876846),\n", " ('x1_2', 0.06817992681764856),\n", " ('x1_3', 0.426290965874321)]"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["cols = ['age'] + list(names)\n", "list(zip(cols, pipe.steps[-1][1].feature_importances_))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Utiliser d'autres variables\n", "\n", "La variable `ticket` contient d'autres informations mais comme chaque ticket est unique, il est difficile de l'utiliser telle quelle. Il faut chercher des redondances d'une ligne \u00e0 l'autre autrement c'est inexploitable. Et la redondance vient des mots que cette colonne contient. Il faut d\u00e9couper en mots :\n", "\n", "* On fait l'inventaire de tous les mots uniques : ce sont des cat\u00e9gories.\n", "* On cr\u00e9e une variable par mot : 1 si l'expression contient le mot, 0 sinon.\n", "\n", "C'est une approche dite *bag of words* ou *sac de mots*."]}, {"cell_type": "code", "execution_count": 21, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["['SOTON/O.Q. 3101310',\n", " 'A/4 48873',\n", " 'STON/O2. 3101283',\n", " 'PC 17611',\n", " 'CA 31352',\n", " 'C 17368',\n", " 'A/5 3594',\n", " 'W./C. 6608',\n", " 'C.A. 34050',\n", " 'C.A. 24580']"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["[_ for _ in set(df.ticket) if ' ' in _][:10]"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": ["X = df[[\"age\", \"sex\", \"pclass\", \"ticket\"]]\n", "y = df['survived']\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "code", "execution_count": 23, "metadata": {"scrolled": false}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 981 and the array at index 2 has size 1\n"]}], "source": ["from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "pipe = Pipeline([\n", "    ('cats', ColumnTransformer([\n", "        ('one', OneHotEncoder(), ['sex', 'pclass']),\n", "        ('imp', SimpleImputer(), ['age']),\n", "        ('bow', CountVectorizer(), ['ticket']),\n", "    ])),\n", "])\n", "\n", "try:\n", "    pipe.fit(X_train)\n", "except ValueError as e:\n", "    print(e)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le mod\u00e8le [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) est un peu probl\u00e8matique car il ne traite qu'une seule colonne. Il faut ruser pour l'utiliser dans un pipeline."]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["<981x756 sparse matrix of type '<class 'numpy.int64'>'\n", "\twith 1170 stored elements in Compressed Sparse Row format>"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["CountVectorizer().fit_transform(X_train['ticket'])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On essaye un petit tour de magie avec la classe *TextVectorizerTransformer*."]}, {"cell_type": "code", "execution_count": 25, "metadata": {"scrolled": false}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\xavierdupre\\__home_\\github_fork\\scikit-learn\\sklearn\\utils\\deprecation.py:144: FutureWarning: The sklearn.cluster.k_means_ module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.\n", "  warnings.warn(message, FutureWarning)\n"]}, {"data": {"text/plain": ["Pipeline(memory=None,\n", "         steps=[('cats',\n", "                 ColumnTransformer(n_jobs=None, remainder='drop',\n", "                                   sparse_threshold=0.3,\n", "                                   transformer_weights=None,\n", "                                   transformers=[('one',\n", "                                                  OneHotEncoder(categories='auto',\n", "                                                                drop=None,\n", "                                                                dtype=<class 'numpy.float64'>,\n", "                                                                handle_unknown='error',\n", "                                                                sparse=True),\n", "                                                  ['sex', 'pclass']),\n", "                                                 ('imp',\n", "                                                  SimpleImputer(add_indicator=False,\n", "                                                                copy=True,\n", "                                                                fill_value=None,\n", "                                                                missing_...\n", "                                                  TextVectorizerTransformer(estimator=CountVectorizer(analyzer='word',\n", "                                                                                                      binary=False,\n", "                                                                                                      decode_error='strict',\n", "                                                                                                      dtype=<class 'numpy.int64'>,\n", "                                                                                                      encoding='utf-8',\n", "                                                                                                      input='content',\n", "                                                                                                      lowercase=True,\n", "                                                                                                      max_df=1.0,\n", "                                                                                                      max_features=None,\n", "                                                                                                      min_df=1,\n", "                                                                                                      ngram_range=(1,\n", "                                                                                                                   1),\n", "                                                                                                      preprocessor=None,\n", "                                                                                                      stop_words=None,\n", "                                                                                                      strip_accents=None,\n", "                                                                                                      token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", "                                                                                                      tokenizer=None,\n", "                                                                                                      vocabulary=None)),\n", "                                                  ['ticket'])],\n", "                                   verbose=False))],\n", "         verbose=False)"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["from papierstat.mltricks import TextVectorizerTransformer\n", "\n", "pipe = Pipeline([\n", "    ('cats', ColumnTransformer([\n", "        ('one', OneHotEncoder(), ['sex', 'pclass']),\n", "        ('imp', SimpleImputer(), ['age']),\n", "        ('bow', TextVectorizerTransformer(CountVectorizer()), ['ticket']),\n", "    ])),\n", "])\n", "\n", "pipe.fit(X_train)"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"data": {"text/plain": ["<328x762 sparse matrix of type '<class 'numpy.float64'>'\n", "\twith 1174 stored elements in Compressed Sparse Row format>"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe.transform(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut ajouter un classifier au pipeline."]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[191,  14],\n", "       [ 35,  88]], dtype=int64)"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["pipe = Pipeline([\n", "    ('cats', ColumnTransformer([\n", "        ('one', OneHotEncoder(), ['sex', 'pclass']),\n", "        ('imp', SimpleImputer(), ['age']),\n", "        ('bow', TextVectorizerTransformer(CountVectorizer()), ['ticket']),\n", "    ])),\n", "    ('rf', RandomForestClassifier(n_estimators=100))\n", "])\n", "\n", "pipe.fit(X_train, y_train)\n", "confusion_matrix(y_test, pipe.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est un meilleur mod\u00e8le."]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 4}