2A.ml - Analyse de sentiments - correction

Links: notebook, html, PDF, python, slides, GitHub

C’est désormais un problème classique de machine learning. D’un côté, du texte, de l’autre une appréciation, le plus souvent binaire, positive ou négative mais qui pourrait être graduelle.

%matplotlib inline
from jyquickhelper import add_notebook_menu
add_notebook_menu()

Les données

On récupère les données depuis le site UCI Sentiment Labelled Sentences Data Set où on utilise la fonction load_sentiment_dataset.

from ensae_teaching_cs.data import load_sentiment_dataset
df = load_sentiment_dataset()
df.head()
sentance sentiment source
0 So there is no way for me to plug it in here i... 0 amazon_cells_labelled
1 Good case, Excellent value. 1 amazon_cells_labelled
2 Great for the jawbone. 1 amazon_cells_labelled
3 Tied to charger for conversations lasting more... 0 amazon_cells_labelled
4 The mic is great. 1 amazon_cells_labelled

Exercice 1 : approche td-idf

La cible est la colonne sentiment, les deux autres colonnes sont les features. Il faudra utiliser les prétraitements LabelEncoder, OneHotEncoder, TF-IDF. L’un d’entre eux n’est pas nécessaire depuis la version 0.20.0 de scikit-learn. On s’occupe des variables catégorielles.

La variable catégorielle

Ce serait un peu plus simple avec le module Category Encoders ou la dernière nouveauté de scikit-learn : ColumnTransformer.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                df.drop("sentiment", axis=1), df["sentiment"])
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
le.fit(X_train["source"])
X_le = le.transform(X_train["source"])
X_le.shape
(2250,)
X_le_mat = X_le.reshape((X_le.shape[0], 1))
ohe = OneHotEncoder(categories="auto")
ohe.fit(X_le_mat)
OneHotEncoder()
X_le_encoded = ohe.transform(X_le_mat)
train_cat = X_le_encoded.todense()
test_cat = ohe.transform(le.transform(X_test["source"]).reshape((len(X_test), 1))).todense()
import pandas
X_train2 = pandas.concat([X_train.reset_index(drop=True),
                          pandas.DataFrame(train_cat, columns=le.classes_)],
                         sort=False, axis=1)
X_train2.head(n=2)
sentance source amazon_cells_labelled imdb_labelled yelp_labelled
0 Now we were chosen to be tortured with this di... imdb_labelled 0.0 1.0 0.0
1 Woa, talk about awful. imdb_labelled 0.0 1.0 0.0
X_test2 = pandas.concat([X_test.reset_index(drop=True),
                         pandas.DataFrame(test_cat, columns=le.classes_)],
                         sort=False, axis=1)
X_test2.head(n=2)
sentance source amazon_cells_labelled imdb_labelled yelp_labelled
0 It looks very nice. amazon_cells_labelled 1.0 0.0 0.0
1 As a European, the movie is a nice throwback t... imdb_labelled 0.0 1.0 0.0

tokenisation

On tokenise avec le module spacy qui requiert des données supplémentaires pour découper en mot avec pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz selon les instructions dévoilées dans le guide de départ ou encore python -m spacy download en. Le module gensim ne requiert pas d’installation. On peut aussi s’inspirer de l’example word2vec pré-entraînés.

import spacy
nlp = spacy.load("en_core_web_sm")
# Ca marche après avoir installé le corpus correspondant
# python -m spacy download en_core_web_sm
doc = nlp(X_train2.iloc[0,0])
[token.text for token in doc]
['Now',
 'we',
 'were',
 'chosen',
 'to',
 'be',
 'tortured',
 'with',
 'this',
 'disgusting',
 'piece',
 'of',
 'blatant',
 'American',
 'propaganda',
 '.',
 ' ']

tf-idf

Une fois que les mots sont tokenisé, on peut appliquer le tf-idf.

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import make_pipeline
tokenizer = lambda text: [token.text.lower() for token in nlp(text)]
count = CountVectorizer(tokenizer=tokenizer, analyzer='word')
tfidf = TfidfTransformer()
pipe = make_pipeline(count, tfidf)
pipe.fit(X_train["sentance"])
Pipeline(steps=[('countvectorizer',
                 CountVectorizer(tokenizer=<function <lambda> at 0x000001DCC8835488>)),
                ('tfidftransformer', TfidfTransformer())])
train_feature = pipe.transform(X_train2["sentance"])
train_feature
<2250x4495 sparse matrix of type '<class 'numpy.float64'>'
    with 29554 stored elements in Compressed Sparse Row format>
test_feature = pipe.transform(X_test2["sentance"])

Combinaison de toutes les variables

train_feature.shape, train_cat.shape
((2250, 4495), (2250, 3))
import numpy
np_train = numpy.hstack([train_feature.todense(), train_cat])
np_test = numpy.hstack([test_feature.todense(), test_cat])

Calage d’un modèle

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=50)
rf.fit(np_train, y_train)
RandomForestClassifier(n_estimators=50)
rf.score(np_test, y_test)
0.7533333333333333

Exercice 2 : word2vec

On utilise l’approche word2vec du module gensim ou spacy. Avec spacy, c’est assez simple :

vv = nlp(X_train2.iloc[0, 0])
list(vv)[0].vector[:10], vv.vector.shape
(array([-0.21269655, -0.7653725 , -0.1316224 , -0.3766306 ,  0.5549566 ,
        -0.60907495,  5.3928123 ,  5.099738  ,  4.210167  ,  2.9974651 ],
       dtype=float32),
 (96,))

On fait la somme.

sum([_.vector for _ in vv])[:10]
array([-11.796999 ,  -8.17019  ,   3.1232045, -14.440253 ,  20.460987 ,
        -8.738287 ,  12.388309 ,  23.718775 ,  -9.392727 ,   1.9914403],
      dtype=float32)
np_train_vect = numpy.zeros((X_train2.shape[0], vv.vector.shape[0]))
for i, sentance in enumerate(X_train2["sentance"]):
    np_train_vect[i, :] = sum(v.vector for v in nlp(sentance.lower()))
np_test_vect = numpy.zeros((X_test2.shape[0], vv.vector.shape[0]))
for i, sentance in enumerate(X_test2["sentance"]):
    np_test_vect[i, :] = sum(v.vector for v in nlp(sentance.lower()))
np_train_v = numpy.hstack([np_train_vect, train_cat])
np_test_v = numpy.hstack([np_test_vect, test_cat])
rfv = RandomForestClassifier(n_estimators=50)
rfv.fit(np_train_v, y_train)
RandomForestClassifier(n_estimators=50)
rfv.score(np_test_v, y_test)
0.6146666666666667

Moins bien…

Exercice 3 : comparer les deux approches

Avec une courbe ROC par exemple.

pmodel1 = rf.predict_proba(np_test)[:, 1]
pmodel2 = rfv.predict_proba(np_test_v)[:, 1]
from sklearn.metrics import roc_auc_score, roc_curve, auc
fpr1, tpr1, th1 = roc_curve(y_test, pmodel1)
fpr2, tpr2, th2 = roc_curve(y_test, pmodel2)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize=(4,4))
ax.plot(fpr1, tpr1, label='tf-idf')
ax.plot(fpr2, tpr2, label='word2vec')
ax.legend();
../_images/td2a_sentiment_analysis_correction_42_0.png

Petite analyse d’erreurs

On combine les erreurs des modèles sur la base de test.

final = X_test.copy()
final["model1"] = pmodel1
final["model2"] = pmodel2
final["label"] = y_test
final.head()
sentance source model1 model2 label
850 It looks very nice. amazon_cells_labelled 0.62 0.62 1
415 As a European, the movie is a nice throwback t... imdb_labelled 0.68 0.54 1
585 Great food and great service in a clean and fr... yelp_labelled 0.94 0.78 1
785 This allows the possibility of double booking ... amazon_cells_labelled 0.48 0.54 0
440 Both do good jobs and are quite amusing. imdb_labelled 0.64 0.46 1

On regarde des erreurs.

erreurs = final[final["label"] == 1].sort_values("model2")
erreurs.head()
sentance source model1 model2 label
527 Be sure to order dessert, even if you need to ... yelp_labelled 0.54 0.20 1
707 This is cool because most cases are just open ... amazon_cells_labelled 0.38 0.22 1
449 I won't say any more - I don't like spoilers, ... imdb_labelled 0.20 0.22 1
676 I can't wait to go back. yelp_labelled 0.20 0.22 1
908 I can hear while I'm driving in the car, and u... amazon_cells_labelled 0.34 0.24 1
list(erreurs["sentance"])[:5]
['Be sure to order dessert, even if you need to pack it to-go - the tiramisu and cannoli are both to die for.',
 'This is cool because most cases are just open there allowing the screen to get all scratched up.',
 "I won't say any more - I don't like spoilers, so I don't want to be one, but I believe this film is worth your time.  ",
 "I can't wait to go back.",
 "I can hear while I'm driving in the car, and usually don't even have to put it on it's loudest setting."]

Le modèle 2 reconnaît mal les négations visiblement. On regarde le modèle 1.

erreurs = final[final["label"] == 1].sort_values("model1")
erreurs.head()
sentance source model1 model2 label
436 The soundtrack wasn't terrible, either. imdb_labelled 0.06 0.34 1
412 Not too screamy not to masculine but just righ... imdb_labelled 0.08 0.26 1
161 I was seated immediately. yelp_labelled 0.10 0.38 1
619 Don't miss it. imdb_labelled 0.12 0.32 1
448 My 8/10 score is mostly for the plot. imdb_labelled 0.14 0.48 1
list(erreurs["sentance"])[:5]
["The soundtrack wasn't terrible, either.  ",
 'Not too screamy not to masculine but just right.  ',
 'I was seated immediately.',
 "Don't miss it.  ",
 'My 8/10 score is mostly for the plot.  ']

Idem, voyons là où les modèles sont en désaccords.

final["diff"] = final.model1 - final.model2
erreurs = final[final["label"] == 1].sort_values("diff")
erreurs.head()
sentance source model1 model2 label diff
390 If you want healthy authentic or ethic food, t... yelp_labelled 0.30 0.72 1 -0.42
797 A good quality bargain.. I bought this after I... amazon_cells_labelled 0.34 0.68 1 -0.34
53 This phone is pretty sturdy and I've never had... amazon_cells_labelled 0.38 0.72 1 -0.34
691 Shot in the Southern California desert using h... imdb_labelled 0.36 0.70 1 -0.34
448 My 8/10 score is mostly for the plot. imdb_labelled 0.14 0.48 1 -0.34
erreurs.tail()
sentance source model1 model2 label diff
464 The inside is really quite nice and very clean. yelp_labelled 0.94 0.46 1 0.48
4 The mic is great. amazon_cells_labelled 0.90 0.42 1 0.48
68 Great for iPODs too. amazon_cells_labelled 0.96 0.46 1 0.50
341 It is a really good show to watch. imdb_labelled 0.84 0.32 1 0.52
306 Has been working great. amazon_cells_labelled 0.90 0.32 1 0.58

Le modèle 2 (word2vec) a l’air meilleur sur les phrases longues, le modèle 1 (tf-idf) saisit mieux les mots positifs. A confirmer sur plus de données.

  • Enlever les stop words, les signes de ponctuation.

  • Combiner les deux approches.

  • n-grammes

Dernière analyse en regardant le taux d’erreur par source.

r1 = rf.predict(np_test)
r2 = rfv.predict(np_test_v)
final["rep1"] = r1
final["rep2"] = r2
final["err1"] = (final.label - final.rep1).abs()
final["err2"] = (final.label - final.rep2).abs()
final["total"] = 1
final.head()
sentance source model1 model2 label diff rep1 rep2 err1 err2 total
850 It looks very nice. amazon_cells_labelled 0.62 0.62 1 0.00 1 1 0 0 1
415 As a European, the movie is a nice throwback t... imdb_labelled 0.68 0.54 1 0.14 1 1 0 0 1
585 Great food and great service in a clean and fr... yelp_labelled 0.94 0.78 1 0.16 1 1 0 0 1
785 This allows the possibility of double booking ... amazon_cells_labelled 0.48 0.54 0 -0.06 0 1 0 1 1
440 Both do good jobs and are quite amusing. imdb_labelled 0.64 0.46 1 0.18 1 0 0 1 1
final[["source", "err1", "err2", "total"]].groupby("source").sum()
err1 err2 total
source
amazon_cells_labelled 56 94 250
imdb_labelled 77 107 253
yelp_labelled 52 88 247

imdb paraît une source une peu plus difficile à saisir. Quoiqu’il en soit, 2000 phrases pour apprendre est assez peu pour apprendre.

Versions utilisées pour ce notebook

spacy s’est montré quelque peu fantasques cette année avec quelques erreurs notamment celle-ci : ValueError: cymem.cymem.Pool has the wrong size, try recompiling. Voici les versions utilisées…

def version(module, sub=True):
    try:
        ver = getattr(module, '__version__', None)
        if ver is None:
            ver = [_ for _ in os.listdir(os.path.join(module.__file__, '..', '..' if sub else '')) \
                   if module.__name__ in _ and 'dist' in _][-1]
        return ver
    except Exception as e:
        return str(e)
import os
import thinc
print("thinc", version(thinc))
import preshed
print("preshed", version(preshed))
import cymem
print("cymem", version(cymem))
import murmurhash
print("murmurhash", version(murmurhash))
import spacy
print("spacy", spacy.__version__)

import msgpack
print("msgpack", version(msgpack))
import numpy
print("numpy", numpy.__version__)
thinc 7.4.1
preshed preshed-3.0.2.dist-info
cymem cymem-2.0.2.dist-info
murmurhash murmurhash-1.0.2.dist-info
spacy 2.3.2
msgpack msgpack_numpy-0.4.4.3.dist-info
numpy 1.18.1