.. _td2asentimentanalysiscorrectionrst: ========================================== 2A.ml - Analyse de sentiments - correction ========================================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/td2a_ml/td2a_sentiment_analysis_correction.ipynb|*` C’est désormais un problème classique de machine learning. D’un côté, du texte, de l’autre une appréciation, le plus souvent binaire, positive ou négative mais qui pourrait être graduelle. .. code:: ipython3 %matplotlib inline .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: Les données ----------- On récupère les données depuis le site UCI `Sentiment Labelled Sentences Data Set `__ où on utilise la fonction ``load_sentiment_dataset``. .. code:: ipython3 from ensae_teaching_cs.data import load_sentiment_dataset df = load_sentiment_dataset() df.head() .. raw:: html

	sentance	sentiment	source
0	So there is no way for me to plug it in here i...	0	amazon_cells_labelled
1	Good case, Excellent value.	1	amazon_cells_labelled
2	Great for the jawbone.	1	amazon_cells_labelled
3	Tied to charger for conversations lasting more...	0	amazon_cells_labelled
4	The mic is great.	1	amazon_cells_labelled

Exercice 1 : approche td-idf ---------------------------- La cible est la colonne *sentiment*, les deux autres colonnes sont les features. Il faudra utiliser les prétraitements `LabelEncoder `__, `OneHotEncoder `__, `TF-IDF `__. L’un d’entre eux n’est pas nécessaire depuis la version `0.20.0 `__ de *scikit-learn*. On s’occupe des variables catégorielles. La variable catégorielle ~~~~~~~~~~~~~~~~~~~~~~~~ Ce serait un peu plus simple avec le module `Category Encoders `__ ou la dernière nouveauté de scikit-learn : `ColumnTransformer `__. .. code:: ipython3 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( df.drop("sentiment", axis=1), df["sentiment"]) .. code:: ipython3 from sklearn.preprocessing import LabelEncoder, OneHotEncoder le = LabelEncoder() le.fit(X_train["source"]) X_le = le.transform(X_train["source"]) X_le.shape .. parsed-literal:: (2250,) .. code:: ipython3 X_le_mat = X_le.reshape((X_le.shape[0], 1)) .. code:: ipython3 ohe = OneHotEncoder(categories="auto") ohe.fit(X_le_mat) .. parsed-literal:: OneHotEncoder() .. code:: ipython3 X_le_encoded = ohe.transform(X_le_mat) train_cat = X_le_encoded.todense() test_cat = ohe.transform(le.transform(X_test["source"]).reshape((len(X_test), 1))).todense() .. code:: ipython3 import pandas X_train2 = pandas.concat([X_train.reset_index(drop=True), pandas.DataFrame(train_cat, columns=le.classes_)], sort=False, axis=1) X_train2.head(n=2) .. raw:: html

	sentance	source	amazon_cells_labelled	imdb_labelled	yelp_labelled
0	Now we were chosen to be tortured with this di...	imdb_labelled	0.0	1.0	0.0
1	Woa, talk about awful.	imdb_labelled	0.0	1.0	0.0

.. code:: ipython3 X_test2 = pandas.concat([X_test.reset_index(drop=True), pandas.DataFrame(test_cat, columns=le.classes_)], sort=False, axis=1) X_test2.head(n=2) .. raw:: html

	sentance	source	amazon_cells_labelled	imdb_labelled	yelp_labelled
0	It looks very nice.	amazon_cells_labelled	1.0	0.0	0.0
1	As a European, the movie is a nice throwback t...	imdb_labelled	0.0	1.0	0.0

tokenisation ~~~~~~~~~~~~ On tokenise avec le module `spacy `__ qui requiert des données supplémentaires pour découper en mot avec ``pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz`` selon les instructions dévoilées dans le `guide de départ `__ ou encore ``python -m spacy download en``. Le module `gensim `__ ne requiert pas d’installation. On peut aussi s’inspirer de l’example `word2vec pré-entraînés `__. .. code:: ipython3 import spacy nlp = spacy.load("en_core_web_sm") # Ca marche après avoir installé le corpus correspondant # python -m spacy download en_core_web_sm .. code:: ipython3 doc = nlp(X_train2.iloc[0,0]) [token.text for token in doc] .. parsed-literal:: ['Now', 'we', 'were', 'chosen', 'to', 'be', 'tortured', 'with', 'this', 'disgusting', 'piece', 'of', 'blatant', 'American', 'propaganda', '.', ' '] tf-idf ~~~~~~ Une fois que les mots sont tokenisé, on peut appliquer le *tf-idf*. .. code:: ipython3 from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer from sklearn.pipeline import make_pipeline tokenizer = lambda text: [token.text.lower() for token in nlp(text)] count = CountVectorizer(tokenizer=tokenizer, analyzer='word') tfidf = TfidfTransformer() pipe = make_pipeline(count, tfidf) .. code:: ipython3 pipe.fit(X_train["sentance"]) .. parsed-literal:: Pipeline(steps=[('countvectorizer', CountVectorizer(tokenizer= at 0x000001DCC8835488>)), ('tfidftransformer', TfidfTransformer())]) .. code:: ipython3 train_feature = pipe.transform(X_train2["sentance"]) train_feature .. parsed-literal:: <2250x4495 sparse matrix of type '' with 29554 stored elements in Compressed Sparse Row format> .. code:: ipython3 test_feature = pipe.transform(X_test2["sentance"]) Combinaison de toutes les variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 train_feature.shape, train_cat.shape .. parsed-literal:: ((2250, 4495), (2250, 3)) .. code:: ipython3 import numpy np_train = numpy.hstack([train_feature.todense(), train_cat]) np_test = numpy.hstack([test_feature.todense(), test_cat]) Calage d’un modèle ~~~~~~~~~~~~~~~~~~ .. code:: ipython3 from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=50) rf.fit(np_train, y_train) .. parsed-literal:: RandomForestClassifier(n_estimators=50) .. code:: ipython3 rf.score(np_test, y_test) .. parsed-literal:: 0.7533333333333333 Exercice 2 : word2vec --------------------- On utilise l’approche `word2vec `__ du module `gensim `__ ou `spacy `__. Avec `spacy `__, c’est assez simple : .. code:: ipython3 vv = nlp(X_train2.iloc[0, 0]) list(vv)[0].vector[:10], vv.vector.shape .. parsed-literal:: (array([-0.21269655, -0.7653725 , -0.1316224 , -0.3766306 , 0.5549566 , -0.60907495, 5.3928123 , 5.099738 , 4.210167 , 2.9974651 ], dtype=float32), (96,)) On fait la somme. .. code:: ipython3 sum([_.vector for _ in vv])[:10] .. parsed-literal:: array([-11.796999 , -8.17019 , 3.1232045, -14.440253 , 20.460987 , -8.738287 , 12.388309 , 23.718775 , -9.392727 , 1.9914403], dtype=float32) .. code:: ipython3 np_train_vect = numpy.zeros((X_train2.shape[0], vv.vector.shape[0])) for i, sentance in enumerate(X_train2["sentance"]): np_train_vect[i, :] = sum(v.vector for v in nlp(sentance.lower())) .. code:: ipython3 np_test_vect = numpy.zeros((X_test2.shape[0], vv.vector.shape[0])) for i, sentance in enumerate(X_test2["sentance"]): np_test_vect[i, :] = sum(v.vector for v in nlp(sentance.lower())) .. code:: ipython3 np_train_v = numpy.hstack([np_train_vect, train_cat]) np_test_v = numpy.hstack([np_test_vect, test_cat]) .. code:: ipython3 rfv = RandomForestClassifier(n_estimators=50) rfv.fit(np_train_v, y_train) .. parsed-literal:: RandomForestClassifier(n_estimators=50) .. code:: ipython3 rfv.score(np_test_v, y_test) .. parsed-literal:: 0.6146666666666667 Moins bien… Exercice 3 : comparer les deux approches ---------------------------------------- Avec une courbe `ROC `__ par exemple. .. code:: ipython3 pmodel1 = rf.predict_proba(np_test)[:, 1] pmodel2 = rfv.predict_proba(np_test_v)[:, 1] .. code:: ipython3 from sklearn.metrics import roc_auc_score, roc_curve, auc fpr1, tpr1, th1 = roc_curve(y_test, pmodel1) fpr2, tpr2, th2 = roc_curve(y_test, pmodel2) .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1,1, figsize=(4,4)) ax.plot(fpr1, tpr1, label='tf-idf') ax.plot(fpr2, tpr2, label='word2vec') ax.legend(); .. image:: td2a_sentiment_analysis_correction_42_0.png Petite analyse d’erreurs ------------------------ On combine les erreurs des modèles sur la base de test. .. code:: ipython3 final = X_test.copy() final["model1"] = pmodel1 final["model2"] = pmodel2 final["label"] = y_test final.head() .. raw:: html

	sentance	source	model1	model2	label
850	It looks very nice.	amazon_cells_labelled	0.62	0.62	1
415	As a European, the movie is a nice throwback t...	imdb_labelled	0.68	0.54	1
585	Great food and great service in a clean and fr...	yelp_labelled	0.94	0.78	1
785	This allows the possibility of double booking ...	amazon_cells_labelled	0.48	0.54	0
440	Both do good jobs and are quite amusing.	imdb_labelled	0.64	0.46	1

On regarde des erreurs. .. code:: ipython3 erreurs = final[final["label"] == 1].sort_values("model2") erreurs.head() .. raw:: html

	sentance	source	model1	model2	label
527	Be sure to order dessert, even if you need to ...	yelp_labelled	0.54	0.20	1
707	This is cool because most cases are just open ...	amazon_cells_labelled	0.38	0.22	1
449	I won't say any more - I don't like spoilers, ...	imdb_labelled	0.20	0.22	1
676	I can't wait to go back.	yelp_labelled	0.20	0.22	1
908	I can hear while I'm driving in the car, and u...	amazon_cells_labelled	0.34	0.24	1

.. code:: ipython3 list(erreurs["sentance"])[:5] .. parsed-literal:: ['Be sure to order dessert, even if you need to pack it to-go - the tiramisu and cannoli are both to die for.', 'This is cool because most cases are just open there allowing the screen to get all scratched up.', "I won't say any more - I don't like spoilers, so I don't want to be one, but I believe this film is worth your time. ", "I can't wait to go back.", "I can hear while I'm driving in the car, and usually don't even have to put it on it's loudest setting."] Le modèle 2 reconnaît mal les négations visiblement. On regarde le modèle 1. .. code:: ipython3 erreurs = final[final["label"] == 1].sort_values("model1") erreurs.head() .. raw:: html

	sentance	source	model1	model2	label
436	The soundtrack wasn't terrible, either.	imdb_labelled	0.06	0.34	1
412	Not too screamy not to masculine but just righ...	imdb_labelled	0.08	0.26	1
161	I was seated immediately.	yelp_labelled	0.10	0.38	1
619	Don't miss it.	imdb_labelled	0.12	0.32	1
448	My 8/10 score is mostly for the plot.	imdb_labelled	0.14	0.48	1

.. code:: ipython3 list(erreurs["sentance"])[:5] .. parsed-literal:: ["The soundtrack wasn't terrible, either. ", 'Not too screamy not to masculine but just right. ', 'I was seated immediately.', "Don't miss it. ", 'My 8/10 score is mostly for the plot. '] Idem, voyons là où les modèles sont en désaccords. .. code:: ipython3 final["diff"] = final.model1 - final.model2 .. code:: ipython3 erreurs = final[final["label"] == 1].sort_values("diff") erreurs.head() .. raw:: html

	sentance	source	model1	model2	label	diff
390	If you want healthy authentic or ethic food, t...	yelp_labelled	0.30	0.72	1	-0.42
797	A good quality bargain.. I bought this after I...	amazon_cells_labelled	0.34	0.68	1	-0.34
53	This phone is pretty sturdy and I've never had...	amazon_cells_labelled	0.38	0.72	1	-0.34
691	Shot in the Southern California desert using h...	imdb_labelled	0.36	0.70	1	-0.34
448	My 8/10 score is mostly for the plot.	imdb_labelled	0.14	0.48	1	-0.34

.. code:: ipython3 erreurs.tail() .. raw:: html

	sentance	source	model1	model2	label	diff
464	The inside is really quite nice and very clean.	yelp_labelled	0.94	0.46	1	0.48
4	The mic is great.	amazon_cells_labelled	0.90	0.42	1	0.48
68	Great for iPODs too.	amazon_cells_labelled	0.96	0.46	1	0.50
341	It is a really good show to watch.	imdb_labelled	0.84	0.32	1	0.52
306	Has been working great.	amazon_cells_labelled	0.90	0.32	1	0.58

Le modèle 2 (word2vec) a l’air meilleur sur les phrases longues, le modèle 1 (tf-idf) saisit mieux les mots positifs. A confirmer sur plus de données. - Enlever les stop words, les signes de ponctuation. - Combiner les deux approches. - n-grammes - … Dernière analyse en regardant le taux d’erreur par source. .. code:: ipython3 r1 = rf.predict(np_test) r2 = rfv.predict(np_test_v) final["rep1"] = r1 final["rep2"] = r2 final["err1"] = (final.label - final.rep1).abs() final["err2"] = (final.label - final.rep2).abs() final["total"] = 1 final.head() .. raw:: html

	sentance	source	model1	model2	label	diff	rep1	rep2	err2	total
850	It looks very nice.	amazon_cells_labelled	0.62	0.62	1	0.00	1	1	0	1
415	As a European, the movie is a nice throwback t...	imdb_labelled	0.68	0.54	1	0.14	1	1	0	1
585	Great food and great service in a clean and fr...	yelp_labelled	0.94	0.78	1	0.16	1	1	0	1
785	This allows the possibility of double booking ...	amazon_cells_labelled	0.48	0.54	0	-0.06	0	1	1	1
440	Both do good jobs and are quite amusing.	imdb_labelled	0.64	0.46	1	0.18	1	0	1	1

.. code:: ipython3 final[["source", "err1", "err2", "total"]].groupby("source").sum() .. raw:: html

	err1	err2	total
source
amazon_cells_labelled	56	94	250
imdb_labelled	77	107	253
yelp_labelled	52	88	247

*imdb* paraît une source une peu plus difficile à saisir. Quoiqu’il en soit, 2000 phrases pour apprendre est assez peu pour apprendre. Versions utilisées pour ce notebook ----------------------------------- `spacy `__ s’est montré quelque peu fantasques cette année avec quelques erreurs notamment celle-ci : `ValueError: cymem.cymem.Pool has the wrong size, try recompiling `__. Voici les versions utilisées… .. code:: ipython3 def version(module, sub=True): try: ver = getattr(module, '__version__', None) if ver is None: ver = [_ for _ in os.listdir(os.path.join(module.__file__, '..', '..' if sub else '')) \ if module.__name__ in _ and 'dist' in _][-1] return ver except Exception as e: return str(e) .. code:: ipython3 import os import thinc print("thinc", version(thinc)) import preshed print("preshed", version(preshed)) import cymem print("cymem", version(cymem)) import murmurhash print("murmurhash", version(murmurhash)) import spacy print("spacy", spacy.__version__) import msgpack print("msgpack", version(msgpack)) import numpy print("numpy", numpy.__version__) .. parsed-literal:: thinc 7.4.1 preshed preshed-3.0.2.dist-info cymem cymem-2.0.2.dist-info murmurhash murmurhash-1.0.2.dist-info spacy 2.3.2 msgpack msgpack_numpy-0.4.4.3.dist-info numpy 1.18.1