{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.1 - Calculer un chi 2 sur un tableau de contingence\n", "\n", "$\\chi_2$ et tableau de contingence, avec *numpy*, avec *scipy* ou sans."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## formule\n", "\n", "Le test du $\\chi_2$ ([wikipedia](https://fr.wikipedia.org/wiki/Test_du_%CF%87%C2%B2)) sert \u00e0 comparer deux distributions. Il peut \u00eatre appliqu\u00e9 sur un [tableau de contingence](https://fr.wikipedia.org/wiki/Test_du_%CF%87%C2%B2#Test_du_.CF.87.C2.B2_d.27ind.C3.A9pendance) pour comparer la distributions observ\u00e9e avec la distribution qu'on observerait si les deux facteurs du tableau \u00e9taient ind\u00e9pendants. On note $M=(m_{ij})$ une matrice de dimension $I \\times J$. Le test du $\\chi_2$ se calcule comme suit :\n", "\n", "* $M = \\sum_{ij} m_{ij}$\n", "* $\\forall i, \\; m_{i \\bullet} = \\sum_j m_{ij}$\n", "* $\\forall j, \\; m_{\\bullet j} = \\sum_i m_{ij}$\n", "* $\\forall i,j \\; n_{ij} = \\frac{m_{i \\bullet} m_{\\bullet j}}{N}$\n", "\n", "Avec ces notations :\n", "\n", "$$T = \\sum_{ij} \\frac{ (m_{ij} - n_{ij})^2}{n_{ij}}$$\n", "\n", "La variable al\u00e9atoire $T$ suit asymptotiquement une loi du $\\chi_2$ \u00e0 $(I-1)(J-1)$ degr\u00e9s de libert\u00e9 ([table](http://www.apprendre-en-ligne.net/random/tablekhi2.html)). Comment le calculer avec [numpy](http://www.numpy.org/) ?"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## tableau au hasard\n", "\n", "On prend un petit tableau qu'on choisit au hasard, de pr\u00e9f\u00e9rence non carr\u00e9 pour d\u00e9tecter des erreurs de calculs."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 4, 5, 2, 1],\n", " [ 6, 3, 1, 7],\n", " [10, 14, 6, 9]])"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "M = numpy.array([[4, 5, 2, 1],\n", " [6, 3, 1, 7],\n", " [10, 14, 6, 9]])\n", "M"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## calcul avec scipy\n", "\n", "Evidemment, il existe une fonction en python qui permet de calculer la statistique $T$ : [chi2_contingency](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["(6.1685985038926212, 6, 0.40457120905808314)"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from scipy.stats import chi2_contingency\n", "chi2, pvalue, degrees, expected = chi2_contingency(M)\n", "chi2, degrees, pvalue"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## calcul avec numpy"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([12, 17, 39]), array([20, 22, 9, 17]), 68)"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["N = M.sum()\n", "ni = numpy.array( [M[i,:].sum() for i in range(M.shape[0])] )\n", "nj = numpy.array( [M[:,j].sum() for j in range(M.shape[1])] )\n", "ni, nj, N"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et comme c'est un usage courant, [numpy](http://www.numpy.org/) propose une fa\u00e7on de faire sans \u00e9crire une boucle avec la fonction [sum](https://docs.scipy.org/doc/numpy-1.11.0/reference/generated/numpy.sum.html) :"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([12, 17, 39]), array([20, 22, 9, 17]), 68)"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["ni = M.sum(axis=1)\n", "nj = M.sum(axis=0)\n", "ni, nj, N"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 3.52941176, 3.88235294, 1.58823529, 3. ],\n", " [ 5. , 5.5 , 2.25 , 4.25 ],\n", " [ 11.47058824, 12.61764706, 5.16176471, 9.75 ]])"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["nij = ni.reshape(M.shape[0], 1) * nj / N\n", "nij"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["6.1685985038926212"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["d = (M - nij) ** 2 / nij\n", "d.sum()"]}, {"cell_type": "code", "execution_count": 8, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}