Coverage for src/ensae_teaching_cs/faq/faq

Hot-keys on this page

r m x p toggle line displays

j k next/prev highlighted chunk

0 (zero) top of page

1 (one) first highlighted chunk

1# -*- coding: utf-8 -*-

2"""

3@file

4@brief Quelques problèmes récurrents avec `pandas <http://pandas.pydata.org/>`_.

5"""

8def read_csv(filepath_or_buffer, encoding="utf8", sep="\t", **args):

9 """

10 Calls function `read_csv <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html?highlight=read_csv#pandas.read_csv>`_

11 with different defaults values. If the encoding is utf8 and the data is a file name, the function

12 checks there is no BOM at the beginning. Otherwise, it uses the encoding ``utf-8-sig``.

14 @param encoding encoding

15 @param filepath_or_buffer filepath_or_buffer

16 @param sep column separator

17 @return DataFrame

19 .. faqref::

20 :tag: pandas

21 :title: Caractères bizarres en utf8 et sous Windows (BOM) ?

23 .. index:: encoding, BOM, UTF8

25 Sous Windows, certains logiciels comme `Notepad <http://fr.wikipedia.org/wiki/Bloc-notes_%28Windows%29>`_

26 permettent d'enregister un fichier sous différents `encodings <http://en.wikipedia.org/wiki/Character_encoding>`_.

27 Avec l'encoding `UTF8 <http://fr.wikipedia.org/wiki/UTF-8>`_, on a parfois un problème avec le premier caractère

28 ``\\ufeff`` car Notepad ajoute ce qu'on appelle un `BOM <http://fr.wikipedia.org/wiki/Indicateur_d%27ordre_des_octets>`_.

29 Par exemple ::

31 import pandas

32 df = pandas.read_csv("dataframe.txt",sep="\\t", encoding="utf8")

33 print(df)

35 Provoque une erreur des plus énervantes ::

37 UnicodeEncodeError: 'charmap' codec can't encode character '\\ufeff' in position 0: character maps to <undefined>

39 Pour contrecarrer ceci, il suffit de modifier l'encoding par

40 `utf-8-sig <https://docs.python.org/3/library/codecs.html#encodings-and-unicode>`_ ::

42 import pandas

43 df = pandas.read_csv("dataframe.txt",sep="\\t", encoding="utf-8-sig")

44 print(df)

45 """

46 import pandas

47 if isinstance(filepath_or_buffer, str):

48 if encoding in ["utf8", "utf-8"]:

49 try:

50 df = pandas.read_csv(

51 filepath_or_buffer,

52 encoding=encoding,

53 sep=sep,

54 **args)

55 if df.columns[0].startswith("\ufeff"):

56 raise UnicodeError(

57 "'charmap' codec can't encode characters in position 0-1325: character maps to <undefined>")

58 return df

59 except UnicodeDecodeError:

60 df = pandas.read_csv(

61 filepath_or_buffer,

62 encoding="utf-8-sig",

63 sep=sep,

64 **args)

65 return df

66 except UnicodeError:

67 df = pandas.read_csv(

68 filepath_or_buffer,

69 encoding="utf-8-sig",

70 sep=sep,

71 **args)

72 return df

73 else:

74 return pandas.read_csv(

75 filepath_or_buffer, encoding=encoding, sep=sep, **args)

76 else:

77 return pandas.read_csv(

78 filepath_or_buffer, encoding=encoding, sep=sep, **args)

81def df_to_clipboard(df, **args):

82 """

83 Copies a dataframe as *csv* text into the clipboard.

85 @param df dataframe

86 @param args additional parameters, such as *sep*,

87 by default the separator *sep*

88 is ``\\t`` for this function until

89 it is defined otherwise

91 It relies on method :epkg:`to_clipboard`_.

93 .. faqref::

94 :title: Copier un dataframe dans le presse-papier - clipboard

95 :tag: pandas

97 Pour récupérer un dataframe dans Excel, on peut utiliser la méthode

98 `to_excel <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html>`_

99 puis ouvrir le fichier dans Excel ou le copier dans le presse-papier et le coller

100 dans une feuille ouverte dans Excel. C'est l'objet de la méthode

101 :epkg:`to_clipboard`::

102

103 df = pandas.DataFrame ( ... )

104 df.to_clipboard(sep="\\t")

105 """

106 if "sep" in args:

107 df.to_clipboard(**args)

108 else:

109 df.to_clipboard(sep="\t", **args)

110

111

112def df_equal(df1, df2):

113 """

114 Compares two dataframe and tells if they are equal.

115

116 @param df1 first dataframe

117 @param df2 second dataframe

118 @return boolean

119

120 The function compare column one by one.

121 It does not check the order of the columns is the same.

122 It reorders the columns before doing the comparison.

123

124 If you need more complex comparison,

125 you can look into function :epkg:`assert_frame_equal`_.

126

127 The function does not handle well NaN values because ``numpy.nan != numpy.nan`` is true.

128 It also compares types:

129

130 .. faqref::

131 :tag: pandas

132 :title: Comment comparer deux dataframe?

133

134 Ecrire ``df1 == df2`` ne compare pas deux dataframes entre deux

135 car le sens n'est pas forcément le même pour tout le monde.

136 Même si les valeurs sont les mêmes, est-ce l'ordre des colonnes

137 est important ?

138 Il faut le faire soi-même pour une comparaison spécifique à

139 vos besoins. Le code ci-dessus

140 compare d'abord les dimensions, ensuite compare l'ordre

141 des colonnes puis enfin les valeurs ::

142

143 if df1.shape != df2.shape:

144 return False

145 l1 = list(df1.columns)

146 l2 = list(df2.columns)

147 l1.sort()

148 l2.sort()

149 if l1 != l2:

150 return False

151 df1 = df1[l1]

152 df2 = df2[l2]

153 t = (df1 == df2).all()

154 s = set(t)

155 return False not in s

156

157 Autres alternatives :

158

159 * `equals <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html>`_

160 * :epkg:`assert_frame_equal`

161 """

162 if df1.shape != df2.shape:

163 return False

164 l1 = list(df1.columns)

165 l2 = list(df2.columns)

166 l1.sort()

167 l2.sort()

168 if l1 != l2:

169 return False

170 df1 = df1[l1]

171 df2 = df2[l2]

172 s = set((df1.dtypes == df2.dtypes))

173 if False in s:

174 return False

175 s = set((df1 == df2).all())

176 return False not in s

177

178

179def groupby_topn(df, by_keys, sort_keys, ascending=True, n=1, as_index=True):

180 """

181 Takes the top *n* rows per group.

182

183 @param df dataframe

184 @param by_keys rows will be grouped by these columns

185 @param sort_keys rows will be sorted by these columns

186 @param ascending parameter associated to sord function

187 @param n n in top *n*

188 @param as_index if False, remove the index after the group by

189 @return result

190

191 .. faqref::

192 :tag: pandas

193 :title: top n lignes avec pandas

194

195 Grouper puis garder les premières observations de ce groupe est un problème

196 classique. Il n'existe pas de meilleure façon de le faire,

197 cela dépend du nombre d'obervations par groupe. Le moyen le plus simple

198 de le faire avec pandas est :

199

200 * grouper les lignes

201 * trier les lignes dans chaque groupe

202 * garder les premières lignes dans chaque groupe

203

204 Ceci donne ::

205

206 df.groupby(by_keys)

207 .apply(lambda x: x.sort_values(sort_keys, ascending=ascending).head(head))

208 .reset_index(drop=True)

209

210 La dernière instruction supprimer l'index ce qui donne au dataframe final

211 la même structure que le dataframe initial.

212

213 .. runpython::

214 :showcode:

215

216 import pandas

217 l = [ dict(k1="a", k2="b", v=4, i=1),

218 dict(k1="a", k2="b", v=5, i=1),

219 dict(k1="a", k2="b", v=4, i=2),

220 dict(k1="b", k2="b", v=1, i=2),

221 dict(k1="b", k2="b", v=1, i=3)]

222 df = pandas.DataFrame(l)

223 df.groupby(["k1", "k2"]).apply(lambda x: x.sort_values(["v", "i"], ascending=True).head(1))

224 print(df)

225 """

226 res = df.groupby(by_keys).apply(lambda x: x.sort_values(

227 sort_keys, ascending=ascending).head(n))

228 if not as_index:

229 res = res.reset_index(drop=True)

230 return res

231

232

233def speed_dataframe():

234 """

235 .. faqref::

236 :tag: pandas

237 :title: Comment créer un dataframe rapidement ?

238

239 Le notebook :ref:`dataframematrixspeedrst` compare différentes manières

240 de créer un `dataframe <http://pandas-docs.github.io/pandas-docs-travis/enhancingperf.html?highlight=dataframe>`_

241 ou un `array <http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html>`_.

242 Quelques enseignemens :

243

244 * Même si les données sont produites par un générateur, pandas les convertit en liste.

245 * La création d'un array est plus rapide à partir d'un générateur plutôt que d'une liste.

246 """

247 pass

Coverage for src/ensae_teaching_cs/faq/faq_pandas.py : 71%

45 statements

Coverage for src/ensae_teaching_cs/faq/faq_pandas.py : 71%

45 statements 32 run 13 missing 0 excluded

45 statements