{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Texte et machine learning\n", "\n", "Revue de m\u00e9thodes de [word embedding](https://en.wikipedia.org/wiki/Word_embedding) statistiques (~ [NLP](https://en.wikipedia.org/wiki/Natural_language_processing)) ou comment transformer une information textuelle en vecteurs dans un espace vectoriel (*features*) ? Deux exercices sont ajout\u00e9s \u00e0 la fin."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Donn\u00e9es\n", "\n", "Nous allons travailler sur des donn\u00e9es twitter collect\u00e9es avec le mot-cl\u00e9 macron : [tweets_macron_sijetaispresident_201609.zip](https://github.com/sdpython/ensae_teaching_cs/raw/master/src/ensae_teaching_cs/data/data_web/tweets_macron_sijetaispresident_201609.zip)."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " \n", " 0 \n", " 1 \n", " \n", " \n", " \n", " \n", " index \n", " 776066992054861825 \n", " 776067660979245056 \n", " \n", " \n", " nb_user_mentions \n", " 0 \n", " 0 \n", " \n", " \n", " nb_extended_entities \n", " 0 \n", " 0 \n", " \n", " \n", " nb_hashtags \n", " 1 \n", " 1 \n", " \n", " \n", " geo \n", " NaN \n", " NaN \n", " \n", " \n", " text_hashtags \n", " , SiJ\u00e9taisPr\u00e9sident \n", " , SiJ\u00e9taisPr\u00e9sident \n", " \n", " \n", " annee \n", " 2016.0 \n", " 2016.0 \n", " \n", " \n", " delimit_mention \n", " NaN \n", " NaN \n", " \n", " \n", " lang \n", " fr \n", " fr \n", " \n", " \n", " id_str \n", " 776066992054861824.0 \n", " 776067660979245056.0 \n", " \n", " \n", " text_mention \n", " NaN \n", " NaN \n", " \n", " \n", " retweet_count \n", " 4.0 \n", " 5.0 \n", " \n", " \n", " favorite_count \n", " 3.0 \n", " 8.0 \n", " \n", " \n", " type_extended_entities \n", " [] \n", " [] \n", " \n", " \n", " text \n", " #SiJ\u00e9taisPr\u00e9sident se serait la fin du monde..... \n", " #SiJ\u00e9taisPr\u00e9sident je donnerai plus de vacance... \n", " \n", " \n", " nb_user_photos \n", " 0.0 \n", " 0.0 \n", " \n", " \n", " nb_urls \n", " 0.0 \n", " 0.0 \n", " \n", " \n", " nb_symbols \n", " 0.0 \n", " 0.0 \n", " \n", " \n", " created_at \n", " Wed Sep 14 14:36:04 +0000 2016 \n", " Wed Sep 14 14:38:43 +0000 2016 \n", " \n", " \n", " delimit_hash \n", " , 0, 18 \n", " , 0, 18 \n", " \n", " \n", "
\n", "
"], "text/plain": [" 0 \\\n", "index 776066992054861825 \n", "nb_user_mentions 0 \n", "nb_extended_entities 0 \n", "nb_hashtags 1 \n", "geo NaN \n", "text_hashtags , SiJ\u00e9taisPr\u00e9sident \n", "annee 2016.0 \n", "delimit_mention NaN \n", "lang fr \n", "id_str 776066992054861824.0 \n", "text_mention NaN \n", "retweet_count 4.0 \n", "favorite_count 3.0 \n", "type_extended_entities [] \n", "text #SiJ\u00e9taisPr\u00e9sident se serait la fin du monde..... \n", "nb_user_photos 0.0 \n", "nb_urls 0.0 \n", "nb_symbols 0.0 \n", "created_at Wed Sep 14 14:36:04 +0000 2016 \n", "delimit_hash , 0, 18 \n", "\n", " 1 \n", "index 776067660979245056 \n", "nb_user_mentions 0 \n", "nb_extended_entities 0 \n", "nb_hashtags 1 \n", "geo NaN \n", "text_hashtags , SiJ\u00e9taisPr\u00e9sident \n", "annee 2016.0 \n", "delimit_mention NaN \n", "lang fr \n", "id_str 776067660979245056.0 \n", "text_mention NaN \n", "retweet_count 5.0 \n", "favorite_count 8.0 \n", "type_extended_entities [] \n", "text #SiJ\u00e9taisPr\u00e9sident je donnerai plus de vacance... \n", "nb_user_photos 0.0 \n", "nb_urls 0.0 \n", "nb_symbols 0.0 \n", "created_at Wed Sep 14 14:38:43 +0000 2016 \n", "delimit_hash , 0, 18 "]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_teaching_cs.data import twitter_zip\n", "df = twitter_zip(as_df=True)\n", "df.head(n=2).T"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5088, 20)"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["df.shape"]}, {"cell_type": "markdown", "metadata": {}, "source": ["5000 tweets n'est pas assez pour tirer des conclusions mais cela donne une id\u00e9e. On supprime les valeurs manquantes."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5087, 2)"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["data = df[[\"retweet_count\", \"text\"]].dropna()\n", "data.shape"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Construire une pond\u00e9ration\n", "\n", "Le texte est toujours d\u00e9licat \u00e0 traiter. Il n'est pas toujours \u00e9vident de sortir d'une information binaire : un mot est-il pr\u00e9sent ou pas. Les mots n'ont aucun sens num\u00e9rique. Une liste de tweets n'a pas beaucoup de sens \u00e0 part les trier par une autre colonne : les retweet par exemple."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " \n", " retweet_count \n", " text \n", " \n", " \n", " \n", " \n", " 2038 \n", " 842.0 \n", " #SiJetaisPresident travailler moins pour gagne... \n", " \n", " \n", " 2453 \n", " 816.0 \n", " #SiJetaisPresident je ferais revenir l'\u00e9t\u00e9 ave... \n", " \n", " \n", " 2627 \n", " 529.0 \n", " #SiJetaisPresident le mcdo livrerai \u00e0 domicile \n", " \n", " \n", " 1402 \n", " 289.0 \n", " #SiJetaisPresident les devoirs \u00e7a serait de re... \n", " \n", " \n", " 2198 \n", " 276.0 \n", " #SiJetaisPresident ? Pr\u00e9sident c'est pour les... \n", " \n", " \n", "
\n", "
"], "text/plain": [" retweet_count text\n", "2038 842.0 #SiJetaisPresident travailler moins pour gagne...\n", "2453 816.0 #SiJetaisPresident je ferais revenir l'\u00e9t\u00e9 ave...\n", "2627 529.0 #SiJetaisPresident le mcdo livrerai \u00e0 domicile\n", "1402 289.0 #SiJetaisPresident les devoirs \u00e7a serait de re...\n", "2198 276.0 #SiJetaisPresident ? Pr\u00e9sident c'est pour les..."]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["data.sort_values(\"retweet_count\", ascending=False).head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Sans cette colonne qui mesure la popularit\u00e9, il faut trouver un moyen d'extraire de l'information. On d\u00e9coupe alors en mots et on constuire un mod\u00e8le de langage : les [n-grammes](https://fr.wikipedia.org/wiki/N-gramme). Si un tweet est constitu\u00e9 de la s\u00e9quence de mots $(m_1, m_2, ..., m_k)$. On d\u00e9finit sa probabilit\u00e9 comme :\n", "\n", "$$P(tweet) = P(w_1, w_2) P(w_3 | w_2, w_1) P(w_4 | w_3, w_2) ... P(w_k | w_{k-1}, w_{k-2})$$\n", "\n", "Dans ce cas, $n=3$ car on suppose que la probabilit\u00e9 d'apparition d'un mot ne d\u00e9pend que des deux pr\u00e9c\u00e9dents. On estime chaque n-grammes comme suit :\n", "\n", "$$P(c | a, b) = \\frac{ \\# (a, b, c)}{ \\# (a, b)}$$\n", "\n", "C'est le nombre de fois o\u00f9 on observe la s\u00e9quence $(a,b,c)$ divis\u00e9 par le nombre de fois o\u00f9 on observe la s\u00e9quence $(a,b)$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Tokenisation\n", "\n", "D\u00e9couper en mots para\u00eet simple ``tweet.split()`` et puis il y a toujours des surprises avec le texte, la prise en compte des tirets, les majuscules, les espaces en trop. On utilse un *tokenizer* d\u00e9di\u00e9 : [TweetTokenizer](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.casual.TweetTokenizer) ou un tokenizer qui prend en compte le langage."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["['#sij\u00e9taispr\u00e9sident',\n", " 'se',\n", " 'serait',\n", " 'la',\n", " 'fin',\n", " 'du',\n", " 'monde',\n", " '...',\n", " 'mdr',\n", " '\ud83d\ude02']"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["from nltk.tokenize import TweetTokenizer\n", "tknzr = TweetTokenizer(preserve_case=False)\n", "tokens = tknzr.tokenize(data.loc[0, \"text\"])\n", "tokens"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### n-grammes\n", "\n", "* [N-Gram-Based Text Categorization: Categorizing Text With Python](http://blog.alejandronolla.com/2013/05/20/n-gram-based-text-categorization-categorizing-text-with-python/)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["[(None, None, None, '#sij\u00e9taispr\u00e9sident'),\n", " (None, None, '#sij\u00e9taispr\u00e9sident', 'se'),\n", " (None, '#sij\u00e9taispr\u00e9sident', 'se', 'serait'),\n", " ('#sij\u00e9taispr\u00e9sident', 'se', 'serait', 'la'),\n", " ('se', 'serait', 'la', 'fin'),\n", " ('serait', 'la', 'fin', 'du'),\n", " ('la', 'fin', 'du', 'monde'),\n", " ('fin', 'du', 'monde', '...'),\n", " ('du', 'monde', '...', 'mdr'),\n", " ('monde', '...', 'mdr', '\ud83d\ude02'),\n", " ('...', 'mdr', '\ud83d\ude02', None),\n", " ('mdr', '\ud83d\ude02', None, None),\n", " ('\ud83d\ude02', None, None, None)]"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from nltk.util import ngrams\n", "generated_ngrams = ngrams(tokens, 4, pad_left=True, pad_right=True)\n", "list(generated_ngrams)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 1 : calculer des n-grammes sur les tweets"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Nettoyage\n", "\n", "Tous les mod\u00e8les sont plus stables sans les stop-words, c'est-\u00e0-dire tous les mots pr\u00e9sents dans n'importe quel documents et qui n'apporte pas de sens (\u00e0, de, le, la, ...). Souvent, on enl\u00e8ve les accents, la ponctuation... Moins de variabilit\u00e9 signifie des statistiques plus fiable."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 2 : nettoyer les tweets\n", "\n", "Voir [stem](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Structure de graphe\n", "\n", "On cherche cette fois-ci \u00e0 construire des coordonn\u00e9es pour chaque tweet."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### matrice d'adjacence"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Une option courante est de d\u00e9couper chaque expression en mots puis de cr\u00e9er une matrice *expression x mot* ou chaque case indique la pr\u00e9sence d'un mot dans une expression."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5087, 11924)"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.feature_extraction.text import CountVectorizer\n", "count_vect = CountVectorizer()\n", "counts = count_vect.fit_transform(data[\"text\"])\n", "counts.shape"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On aboutit \u00e0 une matrice sparse ou chaque expression est repr\u00e9sent\u00e9e \u00e0 une vecteur ou chaque 1 repr\u00e9sente l'appartenance d'un mot \u00e0 l'ensemble."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["scipy.sparse.csr.csr_matrix"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["type(counts)"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0, 0, 0, 0, 0],\n", " [0, 0, 0, 0, 0],\n", " [0, 0, 0, 0, 0],\n", " [0, 0, 0, 0, 0],\n", " [0, 0, 0, 0, 0]], dtype=int64)"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["counts[:5,:5].toarray()"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["'#SiJ\u00e9taisPr\u00e9sident se serait la fin du monde... mdr \ud83d\ude02'"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["data.loc[0,\"text\"]"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["8"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["counts[0,:].sum()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### td-idf"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ce genre de technique produit des matrices de tr\u00e8s grande dimension qu'il faut r\u00e9duire. On peut enlever les mots rares ou les mots tr\u00e8s fr\u00e9quents. [td-idf](https://fr.wikipedia.org/wiki/TF-IDF) est une technique qui vient des moteurs de recherche. Elle construit le m\u00eame type de matrice (m\u00eame dimension) mais associe \u00e0 chaque couple (document - mot) un poids qui d\u00e9pend de la fr\u00e9quence d'un mot globalement et du nombre de documents contenant ce mot.\n", "\n", "$$idf(t) = \\log \\frac{\\# D}{\\#\\{d \\; | \\; t \\in d \\}}$$\n", "\n", "O\u00f9 :\n", "\n", "* $\\#D$ est le nombre de tweets\n", "* $\\#\\{d \\; | \\; t \\in d \\}$ est le nombre de tweets contenant le mot $t$"]}, {"cell_type": "markdown", "metadata": {}, "source": ["$f(t,d)$ est le nombre d'occurences d'un mot $t$ dans un document $d$.\n", "\n", "$$tf(t,d) = \\frac{1}{2} + \\frac{1}{2} \\frac{f(t,d)}{\\max_{t' \\in d} f(t',d)}$$"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On construit le nombre $tfidf(t,f)$\n", "\n", "$$tdidf(t,d) = tf(t,d) idf(t)$$\n", "\n", "Le terme $idf(t)$ favorise les mots pr\u00e9sent dans peu de documents, le terme $tf(t,f)$ favorise les termes r\u00e9p\u00e9t\u00e9s un grand nombre de fois dans le m\u00eame document. On applique \u00e0 la matrice pr\u00e9c\u00e9dente."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5087, 11924)"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.feature_extraction.text import TfidfTransformer\n", "tfidf = TfidfTransformer()\n", "res = tfidf.fit_transform(counts)\n", "res.shape"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["2.6988143126521047"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["res[0,:].sum()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 3 : tf-idf sans mot-cl\u00e9s\n", "\n", "La matrice ainsi cr\u00e9\u00e9e est de grande dimension. Il faut trouver un moyen de la r\u00e9duire avec [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### word2vec\n", "\n", "* [word2vec From theory to practice](http://hen-drik.de/pub/Heuer%20-%20word2vec%20-%20From%20theory%20to%20practice.pdf)\n", "* [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)\n", "* [word2vec](https://radimrehurek.com/gensim/models/word2vec.html)\n", "\n", "Cet algorithme part d'une r\u00e9presentation des mots sous forme de vecteur en un espace de dimension N = le nombre de mots distinct. Un mot est repr\u00e9sent\u00e9 par $(0,0, ..., 0, 1, 0, ..., 0)$. L'astuce consiste \u00e0 r\u00e9duire le nombre de dimensions en compressant avec une ACP, un r\u00e9seau de neurones non lin\u00e9aires."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["['#sij\u00e9taispr\u00e9sident',\n", " 'se',\n", " 'serait',\n", " 'la',\n", " 'fin',\n", " 'du',\n", " 'monde',\n", " '...',\n", " 'mdr',\n", " '\ud83d\ude02']"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["sentences = [tknzr.tokenize(_) for _ in data[\"text\"]]\n", "sentences[0]"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["2022-02-12 18:46:39,284 : INFO : collecting all words and their counts\n", "2022-02-12 18:46:39,284 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types\n", "2022-02-12 18:46:39,331 : INFO : collected 13279 word types from a corpus of 76421 raw words and 5087 sentences\n", "2022-02-12 18:46:39,332 : INFO : Creating a fresh vocabulary\n", "2022-02-12 18:46:39,400 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 retains 13279 unique words (100.0%% of original 13279, drops 0)', 'datetime': '2022-02-12T18:46:39.388519', 'gensim': '4.1.2', 'python': '3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19043-SP0', 'event': 'prepare_vocab'}\n", "2022-02-12 18:46:39,402 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=1 leaves 76421 word corpus (100.0%% of original 76421, drops 0)', 'datetime': '2022-02-12T18:46:39.401509', 'gensim': '4.1.2', 'python': '3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19043-SP0', 'event': 'prepare_vocab'}\n", "2022-02-12 18:46:39,498 : INFO : deleting the raw counts dictionary of 13279 items\n", "2022-02-12 18:46:39,498 : INFO : sample=0.001 downsamples 46 most-common words\n", "2022-02-12 18:46:39,498 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 56028.0861159631 word corpus (73.3%% of prior 76421)', 'datetime': '2022-02-12T18:46:39.498380', 'gensim': '4.1.2', 'python': '3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19043-SP0', 'event': 'prepare_vocab'}\n", "2022-02-12 18:46:39,663 : INFO : estimated required memory for 13279 words and 100 dimensions: 17262700 bytes\n", "2022-02-12 18:46:39,663 : INFO : resetting layer weights\n", "2022-02-12 18:46:39,679 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2022-02-12T18:46:39.678678', 'gensim': '4.1.2', 'python': '3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19043-SP0', 'event': 'build_vocab'}\n", "2022-02-12 18:46:39,680 : INFO : Word2Vec lifecycle event {'msg': 'training model with 3 workers on 13279 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2022-02-12T18:46:39.680669', 'gensim': '4.1.2', 'python': '3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19043-SP0', 'event': 'train'}\n", "2022-02-12 18:46:39,747 : INFO : worker thread finished; awaiting finish of 2 more threads\n", "2022-02-12 18:46:39,755 : INFO : worker thread finished; awaiting finish of 1 more threads\n", "2022-02-12 18:46:39,755 : INFO : worker thread finished; awaiting finish of 0 more threads\n", "2022-02-12 18:46:39,755 : INFO : EPOCH - 1 : training on 76421 raw words (56059 effective words) took 0.1s, 847131 effective words/s\n", "2022-02-12 18:46:39,813 : INFO : worker thread finished; awaiting finish of 2 more threads\n", "2022-02-12 18:46:39,819 : INFO : worker thread finished; awaiting finish of 1 more threads\n", "2022-02-12 18:46:39,823 : INFO : worker thread finished; awaiting finish of 0 more threads\n", "2022-02-12 18:46:39,824 : INFO : EPOCH - 2 : training on 76421 raw words (56030 effective words) took 0.1s, 935688 effective words/s\n", "2022-02-12 18:46:39,881 : INFO : worker thread finished; awaiting finish of 2 more threads\n", "2022-02-12 18:46:39,890 : INFO : worker thread finished; awaiting finish of 1 more threads\n", "2022-02-12 18:46:39,890 : INFO : worker thread finished; awaiting finish of 0 more threads\n", "2022-02-12 18:46:39,890 : INFO : EPOCH - 3 : training on 76421 raw words (55944 effective words) took 0.1s, 905191 effective words/s\n", "2022-02-12 18:46:39,952 : INFO : worker thread finished; awaiting finish of 2 more threads\n", "2022-02-12 18:46:39,963 : INFO : worker thread finished; awaiting finish of 1 more threads\n", "2022-02-12 18:46:39,971 : INFO : worker thread finished; awaiting finish of 0 more threads\n", "2022-02-12 18:46:39,972 : INFO : EPOCH - 4 : training on 76421 raw words (56072 effective words) took 0.1s, 774904 effective words/s\n", "2022-02-12 18:46:40,033 : INFO : worker thread finished; awaiting finish of 2 more threads\n", "2022-02-12 18:46:40,039 : INFO : worker thread finished; awaiting finish of 1 more threads\n", "2022-02-12 18:46:40,042 : INFO : worker thread finished; awaiting finish of 0 more threads\n", "2022-02-12 18:46:40,042 : INFO : EPOCH - 5 : training on 76421 raw words (56047 effective words) took 0.1s, 906799 effective words/s\n", "2022-02-12 18:46:40,043 : INFO : Word2Vec lifecycle event {'msg': 'training on 382105 raw words (280152 effective words) took 0.4s, 776815 effective words/s', 'datetime': '2022-02-12T18:46:40.043431', 'gensim': '4.1.2', 'python': '3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19043-SP0', 'event': 'train'}\n", "2022-02-12 18:46:40,044 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec(vocab=13279, vector_size=100, alpha=0.025)', 'datetime': '2022-02-12T18:46:40.044429', 'gensim': '4.1.2', 'python': '3.9.5 (tags/v3.9.5:0a7dcbd, May 3 2021, 17:27:52) [MSC v.1928 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19043-SP0', 'event': 'created'}\n"]}], "source": ["import gensim, logging\n", "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n", "\n", "model = gensim.models.Word2Vec(sentences, min_count=1)"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('mon', 0.9989398121833801),\n", " ('pays', 0.9989068508148193),\n", " ('ma', 0.9988953471183777),\n", " ('toutes', 0.9988815784454346),\n", " ('leur', 0.9987949132919312),\n", " ('tout', 0.9987940192222595),\n", " ('ses', 0.9987934231758118),\n", " ('mes', 0.998781144618988),\n", " ('france', 0.9987801909446716),\n", " ('au', 0.9987511038780212)]"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["model.wv.similar_by_word(\"fin\")"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["(100,)"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["model.wv[\"fin\"].shape"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([-0.09920651, 0.15360324, 0.10844447, 0.12709534, 0.15020044,\n", " -0.21826063, 0.07867183, 0.2793031 , -0.1988279 , -0.135458 ,\n", " -0.08442771, -0.27579817, 0.05431064, 0.13231573, 0.06987454,\n", " -0.18821737, -0.0537038 , -0.10661628, -0.04758533, -0.3020647 ,\n", " 0.1704731 , 0.0394745 , 0.12408937, -0.05706318, -0.05796036,\n", " 0.03647643, -0.18711708, -0.10510068, -0.10040793, -0.08600791,\n", " 0.13921241, -0.0547129 , 0.09572571, -0.10740169, -0.00452373,\n", " 0.28817332, -0.01231772, 0.06307271, 0.02313815, -0.22305253,\n", " 0.12906754, -0.20111138, -0.12507376, 0.06637593, 0.06323538,\n", " -0.2289281 , -0.18086989, 0.05065202, 0.04751947, 0.0070283 ,\n", " 0.20169634, -0.15028226, 0.04512867, -0.08974832, -0.08562531,\n", " 0.23815149, 0.11708703, -0.08336464, -0.00898065, 0.00677549,\n", " -0.08762765, -0.06554074, 0.1182849 , 0.01473513, -0.11507029,\n", " 0.25605434, -0.05245751, 0.22131208, -0.27702177, 0.17844225,\n", " -0.28551322, 0.09160851, 0.19049928, 0.09809981, 0.18412267,\n", " -0.01433086, -0.06096153, -0.00965379, -0.04718976, 0.04390529,\n", " -0.2812708 , -0.00393267, -0.14382981, 0.09499372, -0.10859697,\n", " -0.07420573, 0.13133654, 0.06538489, 0.24226172, 0.03639907,\n", " 0.28915352, 0.05038366, 0.05872998, -0.0310102 , 0.30720538,\n", " 0.09244314, 0.20608151, 0.00660289, 0.07621165, 0.0461465 ],\n", " dtype=float32)"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["model.wv[\"fin\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Tagging\n", "\n", "L'objectif est de tagger les mots comme d\u00e9terminer si un mot est un verbe, un adjectif ..."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### grammar\n", "\n", "Voir [html.grammar](http://www.nltk.org/api/nltk.html#module-nltk.grammar)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### CRF\n", "\n", "Voir [CRF](http://www.nltk.org/api/nltk.tag.html)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### HMM\n", "\n", "Voir [HMM](http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.hmm)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Clustering\n", "\n", "Une fois qu'on a des coordonn\u00e9es, on peut faire plein de choses."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### LDA\n", "\n", "* [Latent Dirichlet Application](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)\n", "* [LatentDirichletAllocation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": ["from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,\n", " max_features=1000)\n", "tfidf = tfidf_vectorizer.fit_transform(data[\"text\"])"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5087, 1000)"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["tfidf.shape"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": ["from sklearn.decomposition import NMF, LatentDirichletAllocation\n", "lda = LatentDirichletAllocation(n_components=10, max_iter=5,\n", " learning_method='online',\n", " learning_offset=50.,\n", " random_state=0)"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["LatentDirichletAllocation(learning_method='online', learning_offset=50.0,\n", " max_iter=5, random_state=0)"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["lda.fit(tfidf)"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\Python395_x64\\lib\\site-packages\\sklearn\\utils\\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", " warnings.warn(msg, category=FutureWarning)\n"]}, {"data": {"text/plain": ["['avoir', 'bac', 'bah']"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["tf_feature_names = tfidf_vectorizer.get_feature_names()\n", "tf_feature_names[100:103]"]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": ["def print_top_words(model, feature_names, n_top_words):\n", " for topic_idx, topic in enumerate(model.components_):\n", " print(\"Topic #%d:\" % topic_idx)\n", " print(\" \".join([feature_names[i]\n", " for i in topic.argsort()[- n_top_words - 1:][::-1]]))\n", " print()"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Topic #0:\n", "gratuit mcdo supprimerai \u00e9cole soir kebab macdo kfc domicile cc volont\u00e9\n", "Topic #1:\n", "macron co https de la est le il et hollande un\n", "Topic #2:\n", "sijetaispresident je les de la et le des en pour que\n", "Topic #3:\n", "notaires eu organiserais mets carte nouveaux journ\u00e9es installation cache cr\u00e9er sijetaispresident\n", "Topic #4:\n", "sijetaispresident interdirais les je ballerines la serait serais bah de interdit\n", "Topic #5:\n", "ministre de sijetaispresident la je premier mort et nommerais pr\u00e9sident plus\n", "Topic #6:\n", "cours le supprimerais jour sijetaispresident lundi samedi semaine je vendredi dimanche\n", "Topic #7:\n", "port interdirait d\u00e9missionnerais promesses heure rendrai ballerine mes changement christineboutin tiendrais\n", "Topic #8:\n", "seraient sijetaispresident gratuits aux les nos putain \u00e9ducation nationale bonne aurais\n", "Topic #9:\n", "bordel seront l\u00e9galiserai putes gratuites pizza mot virerais vitesse dutreil vivre\n", "\n"]}], "source": ["print_top_words(lda, tf_feature_names, 10)"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.02703569, 0.02703991, 0.75666556, 0.02703569, 0.02704012,\n", " 0.02703837, 0.02703696, 0.02703608, 0.02703592, 0.02703569],\n", " [0.02276328, 0.02277087, 0.79511841, 0.02276199, 0.02276289,\n", " 0.02276525, 0.02277065, 0.02276215, 0.02276251, 0.02276199],\n", " [0.02318042, 0.79137016, 0.02318268, 0.02318042, 0.02318137,\n", " 0.02318192, 0.0231807 , 0.02318045, 0.02318146, 0.02318042],\n", " [0.0294858 , 0.73460096, 0.02949239, 0.0294858 , 0.02949433,\n", " 0.0294906 , 0.0294873 , 0.02948597, 0.02948989, 0.02948696],\n", " [0.0260542 , 0.66003211, 0.02607499, 0.0260542 , 0.02605546,\n", " 0.13151004, 0.02605456, 0.0260542 , 0.02605602, 0.0260542 ]])"]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["tr = lda.transform(tfidf)\n", "tr[:5]"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5087, 10)"]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["tr.shape"]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": ["import pyLDAvis\n", "import pyLDAvis.sklearn\n", "pyLDAvis.enable_notebook()"]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\Python395_x64\\lib\\site-packages\\ipykernel\\ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n", " and should_run_async(code)\n", "C:\\Python395_x64\\lib\\site-packages\\sklearn\\utils\\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", " warnings.warn(msg, category=FutureWarning)\n", "C:\\Python395_x64\\lib\\site-packages\\pyLDAvis\\_prepare.py:246: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n", " default_term_info = default_term_info.sort_values(\n"]}, {"data": {"text/html": ["\n", " \n", "\n", "\n", "
\n", ""], "text/plain": ["PreparedData(topic_coordinates= x y topics cluster Freq\n", "topic \n", "2 0.132172 0.049678 1 1 36.881857\n", "1 0.115237 0.158473 2 1 26.106389\n", "5 0.174221 0.095581 3 1 9.645160\n", "4 0.157026 -0.190649 4 1 5.727111\n", "6 -0.021095 -0.162058 5 1 4.590570\n", "8 0.005103 -0.062020 6 1 4.341137\n", "0 -0.171929 -0.012022 7 1 3.476527\n", "9 -0.157733 0.042830 8 1 3.247126\n", "7 -0.101223 0.019969 9 1 3.118181\n", "3 -0.131778 0.060219 10 1 2.865942, topic_info= Term Freq Total Category logprob loglift\n", "837 sijetaispresident 423.000000 423.000000 Default 30.0000 30.0000\n", "494 les 276.000000 276.000000 Default 29.0000 29.0000\n", "460 je 308.000000 308.000000 Default 28.0000 28.0000\n", "447 interdirais 58.000000 58.000000 Default 27.0000 27.0000\n", "397 gratuit 49.000000 49.000000 Default 26.0000 26.0000\n", ".. ... ... ... ... ... ...\n", "245 des 0.302704 139.492731 Topic10 -7.1798 -2.5807\n", "479 la 0.301500 246.705851 Topic10 -7.1838 -3.1549\n", "646 pas 0.301246 97.723601 Topic10 -7.1846 -2.2297\n", "321 et 0.300239 182.097554 Topic10 -7.1880 -2.8554\n", "277 du 0.298278 91.424869 Topic10 -7.1945 -2.1730\n", "\n", "[513 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "1 5 0.418647 0000\n", "5 5 0.860174 10h\n", "6 1 0.214435 11\n", "6 2 0.428870 11\n", "7 5 0.856212 12\n", "... ... ... ...\n", "982 6 0.841685 \u00e9conomie\n", "983 2 0.801675 \u00e9conomique\n", "985 5 0.763411 \u00e9couter\n", "986 6 0.915375 \u00e9ducation\n", "987 1 0.583670 \u00e9lus\n", "\n", "[623 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[3, 2, 6, 5, 7, 9, 1, 10, 8, 4])"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["pyLDAvis.sklearn.prepare(lda, tfidf, tfidf_vectorizer)"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 4 : LDA\n", "\n", "Recommencer en supprimant les stop-words pour avoir des r\u00e9sultats plus propres."]}, {"cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5"}}, "nbformat": 4, "nbformat_minor": 2}