{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.eco - Python et la logique SQL - correction\n", "\n", "Correction d'exercices sur SQL."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["SQL permet de cr\u00e9er des tables, de rechercher, d'ajouter, de modifier ou de supprimer des donn\u00e9es dans les bases de donn\u00e9es. \n", "Un peu ce que vous ferez bient\u00f4t tous les jours. C\u2019est un langage de management de donn\u00e9es, pas de nettoyage, d\u2019analyse ou de statistiques avanc\u00e9es.\n", "\n", "Les instructions SQL s'\u00e9crivent d'une mani\u00e8re qui ressemble \u00e0 celle de phrases ordinaires en anglais. Cette ressemblance voulue vise \u00e0 faciliter l'apprentissage et la lecture. Il est n\u00e9anmoins important de respecter un ordre pour les diff\u00e9rentes instructions.\n", "\n", "Dans ce TD, nous allons \u00e9crire des commandes en SQL via Python.\n", "\n", "Pour plus de pr\u00e9cisions sur SQL et les commandes qui existent, rendez-vous l\u00e0 [SQL, PRINCIPES DE BASE](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/ext2a/sql_doc.html)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Se connecter \u00e0 une base de donn\u00e9es\n", "\n", "A la diff\u00e9rence des tables qu'on utilise habituellement, la base de donn\u00e9es n'est pas visible directement en ouvrant Excel ou un \u00e9diteur de texte. Pour avoir une vue de ce que contient la base de donn\u00e9es, il est n\u00e9cessaire d'avoir un autre type de logiciel.\n", "\n", "Pour le TD, nous vous recommandans d'installer SQLLiteSpy (disponible \u00e0 cette adresse [SqliteSpy](http://www.yunqa.de/delphi/products/sqlitespy/index) ou [sqlite_bro](https://pypi.python.org/pypi/sqlite_bro) si vous voulez voir \u00e0 quoi ressemble les donn\u00e9es avant de les utiliser avec Python."]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": ["import sqlite3\n", "# on va se connecter \u00e0 une base de donn\u00e9es SQL vide\n", "# SQLite stocke la BDD dans un simple fichier\n", "filepath = \"./DataBase.db\"\n", "open(filepath, 'w').close() #cr\u00e9e un fichier vide\n", "CreateDataBase = sqlite3.connect(filepath)\n", "\n", "QueryCurs = CreateDataBase.cursor()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La m\u00e9thode cursor() est un peu particuli\u00e8re : \n", "\n", "Il s'agit d'une sorte de tampon m\u00e9moire interm\u00e9diaire, destin\u00e9 \u00e0 m\u00e9moriser temporairement les donn\u00e9es en cours de traitement, ainsi que les op\u00e9rations que vous effectuez sur elles, avant leur transfert d\u00e9finitif dans la base de donn\u00e9es. Tant que la m\u00e9thode .commit() n'aura pas \u00e9t\u00e9 appel\u00e9e, aucun ordre ne sera appliqu\u00e9 \u00e0 la base de donn\u00e9es."]}, {"cell_type": "markdown", "metadata": {}, "source": ["--------------------\n", "A pr\u00e9sent que nous sommes connect\u00e9s \u00e0 la base de donn\u00e9es, on va cr\u00e9er une table qui contient plusieurs variables de format diff\u00e9rents\n", "- ID sera la cl\u00e9 primaire de la base\n", "- Nom, Rue, Ville, Pays seront du text\n", "- Prix sera un r\u00e9el"]}, {"cell_type": "code", "execution_count": 3, "metadata": {"collapsed": true}, "outputs": [], "source": ["# On d\u00e9finit une fonction de cr\u00e9ation de table\n", "def CreateTable(nom_bdd):\n", " QueryCurs.execute('''CREATE TABLE IF NOT EXISTS ''' + nom_bdd + '''\n", " (id INTEGER PRIMARY KEY, Name TEXT,City TEXT, Country TEXT, Price REAL)''')\n", "\n", "# On d\u00e9finit une fonction qui permet d'ajouter des observations dans la table \n", "def AddEntry(nom_bdd, Nom,Ville,Pays,Prix):\n", " QueryCurs.execute('''INSERT INTO ''' + nom_bdd + ''' \n", " (Name,City,Country,Price) VALUES (?,?,?,?)''',(Nom,Ville,Pays,Prix))\n", " \n", "def AddEntries(nom_bdd, data):\n", " \"\"\" data : list with (Name,City,Country,Price) tuples to insert\n", " \"\"\"\n", " QueryCurs.executemany('''INSERT INTO ''' + nom_bdd + ''' \n", " (Name,City,Country,Price) VALUES (?,?,?,?)''',data)\n", " \n", " \n", "### On va cr\u00e9er la table clients\n", "\n", "CreateTable('Clients')\n", "\n", "AddEntry('Clients','Toto','Munich','Germany',5.2)\n", "AddEntries('Clients',\n", " [('Bill','Berlin','Germany',2.3),\n", " ('Tom','Paris','France',7.8),\n", " ('Marvin','Miami','USA',15.2),\n", " ('Anna','Paris','USA',7.8)])\n", "\n", "# on va \"commit\" c'est \u00e0 dire qu'on va valider la transaction. \n", "# > on va envoyer ses modifications locales vers le r\u00e9f\u00e9rentiel central - la base de donn\u00e9es SQL\n", "\n", "CreateDataBase.commit()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Voir la table\n", "Pour voir ce qu'il y a dans la table, on utilise un premier Select o\u00f9 on demande \u00e0 voir toute la table"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[(1, 'Toto', 'Munich', 'Germany', 5.2), (2, 'Bill', 'Berlin', 'Germany', 2.3), (3, 'Tom', 'Paris', 'France', 7.8), (4, 'Marvin', 'Miami', 'USA', 15.2), (5, 'Anna', 'Paris', 'USA', 7.8)]\n"]}], "source": ["QueryCurs.execute('SELECT * FROM Clients')\n", "Values = QueryCurs.fetchall()\n", "print(Values)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Passer en pandas\n", "\n", "Rien de plus simple : plusieurs mani\u00e8res de faire"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["En utilisant la m\u00e9thode read_sql_query \n", " id Name City Country Price\n", "0 1 Toto Munich Germany 5.2\n", "1 2 Bill Berlin Germany 2.3\n", "2 3 Tom Paris France 7.8\n", "3 4 Marvin Miami USA 15.2\n", "4 5 Anna Paris USA 7.8 \n", "\n", "En passant par une DataFrame \n", " ID Name City Country Price\n", "0 1 Toto Munich Germany 5.2\n", "1 2 Bill Berlin Germany 2.3\n", "2 3 Tom Paris France 7.8\n", "3 4 Marvin Miami USA 15.2\n", "4 5 Anna Paris USA 7.8\n"]}], "source": ["import pandas as pd\n", "# m\u00e9thode SQL Query\n", "df1 = pd.read_sql_query('SELECT * FROM Clients', CreateDataBase)\n", "print(\"En utilisant la m\u00e9thode read_sql_query \\n\", df1.head(), \"\\n\")\n", "\n", "\n", "#m\u00e9thode DataFrame en utilisant la liste issue de .fetchall()\n", "df2 = pd.DataFrame(Values, columns=['ID','Name','City','Country','Price'])\n", "print(\"En passant par une DataFrame \\n\", df2.head())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Comparaison SQL et pandas\n", "### SELECT\n", "\n", "En SQL, la s\u00e9lection se fait en utilisant des virgules ou * si on veut s\u00e9lectionner toutes les colonnes"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[(1, 'Munich'), (2, 'Berlin')]\n"]}], "source": ["# en SQL\n", "QueryCurs.execute('SELECT ID,City FROM Clients LIMIT 2')\n", "Values = QueryCurs.fetchall()\n", "print(Values)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["En pandas, la s\u00e9lection de colonnes se fait en donnant une liste"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDCity
01Munich
12Berlin
\n", "
"], "text/plain": [" ID City\n", "0 1 Munich\n", "1 2 Berlin"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["#sur la table\n", "df2[['ID','City']].head(2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### WHERE\n", "En SQL, on utilise WHERE pour filtrer les tables selon certaines conditions"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[(3, 'Tom', 'Paris', 'France', 7.8), (5, 'Anna', 'Paris', 'USA', 7.8)]\n"]}], "source": ["QueryCurs.execute('SELECT * FROM Clients WHERE City==\"Paris\"')\n", "print(QueryCurs.fetchall())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Avec Pandas, on peut utiliser plusieurs mani\u00e8res de faire : \n", " - avec un bool\u00e9en\n", " - en utilisant la m\u00e9thode 'query'"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDNameCityCountryPrice
23TomParisFrance7.8
45AnnaParisUSA7.8
\n", "
"], "text/plain": [" ID Name City Country Price\n", "2 3 Tom Paris France 7.8\n", "4 5 Anna Paris USA 7.8"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["df2[df2['City'] == \"Paris\"]"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDNameCityCountryPrice
23TomParisFrance7.8
45AnnaParisUSA7.8
\n", "
"], "text/plain": [" ID Name City Country Price\n", "2 3 Tom Paris France 7.8\n", "4 5 Anna Paris USA 7.8"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["df2.query('City == \"Paris\"')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour mettre plusieurs conditions, on utilise : \n", "- & en Python, AND en SQL\n", "- | en python, OR en SQL"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[(5, 'Anna', 'Paris', 'USA', 7.8)]\n"]}], "source": ["QueryCurs.execute('SELECT * FROM Clients WHERE City==\"Paris\" AND Country == \"USA\"')\n", "print(QueryCurs.fetchall())"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDNameCityCountryPrice
45AnnaParisUSA7.8
\n", "
"], "text/plain": [" ID Name City Country Price\n", "4 5 Anna Paris USA 7.8"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["df2.query('City == \"Paris\" & Country == \"USA\"')\n", "\n", "df2[(df2['City'] == \"Paris\") & (df2['Country'] == \"USA\")]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## GROUP BY\n", "\n", "En pandas, l'op\u00e9ration GROUP BY de SQL s'effectue avec une m\u00e9thode similaire : groupby() \n", "\n", "groupby() sert \u00e0 regrouper des observations en groupes selon les modalit\u00e9s de certaines variables en appliquant une fonction d'aggr\u00e9gation sur d'autres variables."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[('France', 1), ('Germany', 2), ('USA', 2)]\n"]}], "source": ["QueryCurs.execute('SELECT Country, count(*) FROM Clients GROUP BY Country')\n", "print(QueryCurs.fetchall())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Attention, en pandas, la fonction count() ne fait pas la m\u00eame chose qu'en SQL. Count() s'applique \u00e0 toutes les colonnes et compte toutes les observations non nulles."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDNameCityPrice
Country
France1111
Germany2222
USA2222
\n", "
"], "text/plain": [" ID Name City Price\n", "Country \n", "France 1 1 1 1\n", "Germany 2 2 2 2\n", "USA 2 2 2 2"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["df2.groupby('Country').count()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour r\u00e9aliser la m\u00eame chose qu'en SQL, il faut utiliser la m\u00e9thode size()"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["Country\n", "France 1\n", "Germany 2\n", "USA 2\n", "dtype: int64"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["df2.groupby('Country').size()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut aussi appliquer des fonctions plus sophistiqu\u00e9es lors d'un groupby"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[('France', 7.8, 1), ('Germany', 3.75, 2), ('USA', 11.5, 2)]\n"]}], "source": ["QueryCurs.execute('SELECT Country, AVG(Price), count(*) FROM Clients GROUP BY Country')\n", "print(QueryCurs.fetchall())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Avec pandas, on peut appeler les fonctions classiques de numpy"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PriceCountry
Country
France7.801
Germany3.752
USA11.502
\n", "
"], "text/plain": [" Price Country\n", "Country \n", "France 7.80 1\n", "Germany 3.75 2\n", "USA 11.50 2"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy as np\n", "df2.groupby('Country').agg({'Price': np.mean, 'Country': np.size})"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou utiliser des fonctions lambda"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["Country\n", "France 15.6\n", "Germany 7.5\n", "USA 23.0\n", "Name: Price, dtype: float64"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["# par exemple calculer le prix moyen et le multiplier par 2\n", "df2.groupby('Country')['Price'].apply(lambda x: 2*x.mean())"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('France', 15.6), ('Germany', 7.5), ('USA', 23.0)]"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["QueryCurs.execute('SELECT Country, 2*AVG(Price) FROM Clients GROUP BY Country').fetchall()"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[(1, 'Toto', 'Munich', 'Germany', 5.2), (2, 'Bill', 'Berlin', 'Germany', 2.3)]\n", "[(2, 'Bill', 'Berlin', 'Germany', 2.3)]\n", "[(3, 'Tom', 'Paris', 'France', 7.8), (4, 'Marvin', 'Miami', 'USA', 15.2), (5, 'Anna', 'Paris', 'USA', 7.8)]\n"]}], "source": ["QueryCurs.execute('SELECT * FROM Clients WHERE Country == \"Germany\"')\n", "print(QueryCurs.fetchall())\n", "QueryCurs.execute('SELECT * FROM Clients WHERE City==\"Berlin\" AND Country == \"Germany\"')\n", "print(QueryCurs.fetchall())\n", "QueryCurs.execute('SELECT * FROM Clients WHERE Price BETWEEN 7 AND 20')\n", "print(QueryCurs.fetchall())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Enregistrer une table SQL sous un autre format \n", "\n", "On utilise le package csv, l'option 'w' pour 'write'. \n", "\n", "On cr\u00e9e l'objet \"writer\", qui vient du package csv.\n", "\n", "Cet objet a deux m\u00e9thodes : \n", "- writerow pour les noms de colonnes : une liste\n", "- writerows pour les lignes : un ensemble de liste\n"]}, {"cell_type": "code", "execution_count": 21, "metadata": {"collapsed": true}, "outputs": [], "source": ["data = QueryCurs.execute('SELECT * FROM Clients')\n", "\n", "import csv\n", "\n", "with open('./output.csv', 'w') as file:\n", " writer = csv.writer(file)\n", " writer.writerow(['id','Name','City','Country','Price'])\n", " writer.writerows(data)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut \u00e9galement passer par un DataFrame pandas et utiliser .to_csv()"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": [""]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["QueryCurs.execute('''DROP TABLE Clients''')\n", "#QueryCurs.close()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Dans cet exercice, nous allons manipuler les tables de la base de donn\u00e9es World. \n", "\n", "Avant tout, connectez vous \u00e0 la base de don\u00e9nes en utilisant sqlite3 et connect\n", "\n", "Lien vers la base de donn\u00e9es : [World.db3](https://github.com/sdpython/ensae_teaching_cs/raw/master/src/ensae_teaching_cs/data/data_sql/World.db3) ou \n", "\n", "```\n", "from ensae_teaching_cs.data import simple_database\n", "name = simple_database()\n", "```"]}, {"cell_type": "code", "execution_count": 23, "metadata": {"collapsed": true}, "outputs": [], "source": ["#Se connecter \u00e0 la base de donn\u00e9es WORLD\n", "CreateDataBase = sqlite3.connect(\"./World.db3\")\n", "QueryCurs = CreateDataBase.cursor()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Familiarisez vous avec la base de donn\u00e9es : quelles sont les tables ? quelles sont les variables de ces tables ? \n", " - utilisez la fonction PRAGMA pour obtenir des informations sur les tables\n", " "]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Table : City\n", "Colonnes ['ID', 'Name', 'CountryCode', 'District', 'Population']\n", "1\u00e8re ligne [(1, 'Kabul', 'AFG', 'Kabol', 1780000)] \n", "\n", "Table : Country\n", "Colonnes ['Code', 'Name', 'Continent', 'Region', 'SurfaceArea', 'IndepYear', 'Population', 'LifeExpectancy', 'GNP', 'GNPOld', 'LocalName', 'GovernmentForm', 'HeadOfState', 'Capital', 'Code2']\n", "1\u00e8re ligne [('AFG', 'Afghanistan', 'Asia', 'Southern and Central Asia', 652090.0, 1919, 22720000, 45.9, 5976.0, None, 'Afganistan/Afqanestan', 'Islamic Emirate', 'Mohammad Omar', 1, 'AF')] \n", "\n", "Table : CountryLanguage\n", "Colonnes ['CountryCode', 'Language', 'IsOfficial', 'Percentage']\n", "1\u00e8re ligne [('AFG', 'Pashto', 1, 52.4)] \n", "\n"]}], "source": ["# pour obtenir la liste des tables dans la base de donn\u00e9es\n", "tables = QueryCurs.execute(\"SELECT name FROM sqlite_master WHERE type='table';\").fetchall()\n", "\n", "# on veut voir les colonnes de chaque table ainsi que la premi\u00e8re ligne \n", "for table in tables : \n", " print(\"Table :\", table[0])\n", " schema = QueryCurs.execute(\"PRAGMA table_info({})\".format(table[0])).fetchall()\n", " print(\"Colonnes\", [\"{}\".format(x[1]) for x in schema])\n", " print(\"1\u00e8re ligne\", QueryCurs.execute('SELECT * FROM {} LIMIT 1'.format(table[0])).fetchall(), \"\\n\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Question 1\n", "- Quels sont les 10 pays qui ont le plus de langues ?\n", "- Quelle langue est pr\u00e9sente dans le plus de pays ?"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('AGO', 13),\n", " ('CAN', 12),\n", " ('CHN', 12),\n", " ('IND', 12),\n", " ('RUS', 12),\n", " ('USA', 12),\n", " ('TZA', 11),\n", " ('ZAF', 11),\n", " ('COD', 10),\n", " ('IRN', 10)]"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["QueryCurs.execute(\"\"\"SELECT CountryCode, COUNT(*) as NB \n", " FROM CountryLanguage \n", " GROUP BY CountryCode \n", " ORDER BY NB DESC\n", " LIMIT 10\"\"\").fetchall()"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('English', 60)]"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["QueryCurs.execute('''SELECT Language, COUNT(*) as NB \n", " FROM CountryLanguage \n", " GROUP BY Language \n", " ORDER BY -NB\n", " LIMIT 1''').fetchall()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Question 2\n", "- Quelles sont les diff\u00e9rentes formes de gouvernements dans les pays du monde ?\n", "- Quels sont les 3 gouvernements o\u00f9 la population est la plus importante ?"]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('Islamic Emirate',),\n", " ('Constitutional Monarchy',),\n", " ('Nonmetropolitan Territory of The Netherlands',),\n", " ('Republic',),\n", " ('US Territory',),\n", " ('Parliamentary Coprincipality',),\n", " ('Dependent Territory of the UK',),\n", " ('Emirate Federation',),\n", " ('Federal Republic',),\n", " ('Constitutional Monarchy, Federation',),\n", " ('Monarchy (Emirate)',),\n", " ('Monarchy',),\n", " ('Monarchy (Sultanate)',),\n", " ('Nonmetropolitan Territory of New Zealand',),\n", " ('Part of Denmark',),\n", " ('Overseas Department of France',),\n", " ('Special Administrative Region of China',),\n", " ('Dependent Territory of Norway',),\n", " ('Islamic Republic',),\n", " ('Administrated by the UN',),\n", " ('Territory of Australia',),\n", " (\"People'sRepublic\",),\n", " ('Socialistic Republic',),\n", " ('Constitutional Monarchy (Emirate)',),\n", " ('Socialistic State',),\n", " ('Occupied by Marocco',),\n", " ('Territorial Collectivity of France',),\n", " ('Commonwealth of the US',),\n", " ('Nonmetropolitan Territory of France',),\n", " ('Parlementary Monarchy',),\n", " ('Federation',),\n", " ('Independent Church State',),\n", " ('Autonomous Area',),\n", " ('Co-administrated',),\n", " ('Dependent Territory of the US',)]"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["QueryCurs.execute('''SELECT DISTINCT GovernmentForm FROM Country''').fetchall()"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('Federal Republic', 2009320500),\n", " ('Republic', 2003311200),\n", " (\"People'sRepublic\", 1277558000)]"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["QueryCurs.execute('''SELECT GovernmentForm, SUM(Population) as Pop_Totale_Gouv\n", " FROM Country\n", " GROUP BY GovernmentForm\n", " ORDER BY Pop_Totale_Gouv DESC\n", " LIMIT 3\n", " ''').fetchall()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Question 3 \n", "- Combien de pays ont Elisabeth II \u00e0 la t\u00eate de leur gouvernement ?\n", "\n", "- Quelle proporition des sujets de Sa Majest\u00e9 ne parlent pas anglais ?\n", " - 78 % ou 83% ?"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('Elisabeth II', 35)]"]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["QueryCurs.execute('''SELECT HeadOfState, Count(*)\n", "FROM Country\n", "WHERE HeadOfState = \"Elisabeth II\" ''').fetchall()"]}, {"cell_type": "code", "execution_count": 30, "metadata": {"collapsed": true}, "outputs": [], "source": ["# la population totale \n", "population_queen_elisabeth = QueryCurs.execute('''SELECT HeadOfState, SUM(Population)\n", "FROM Country\n", "WHERE HeadOfState = \"Elisabeth II\"''').fetchall()"]}, {"cell_type": "code", "execution_count": 31, "metadata": {"collapsed": true}, "outputs": [], "source": ["# La part de la population parlant anglais\n", "Part_parlant_anglais= QueryCurs.execute('''SELECT Language, SUM(Percentage*0.01*Population)\n", "FROM \n", "Country\n", "LEFT JOIN \n", "CountryLanguage \n", "ON Country.Code = CountryLanguage.CountryCode\n", "WHERE HeadOfState = \"Elisabeth II\"\n", "AND Language = \"English\"\n", "''').fetchall()"]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.7793618770018202"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["# La r\u00e9ponse est 78% d'apr\u00e8s ces donn\u00e9es\n", "Part_parlant_anglais[0][1]/population_queen_elisabeth[0][1]"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('English', 95762181.19999999, 114637500)]"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["## on trouve 83% si on ne fait pas attention au fait que dans certaines zones, 0% de la population parle anglais\n", "## La population totale n'est alors pas la bonne, comme dans cet exemple\n", "\n", "QueryCurs.execute('''SELECT Language,\n", "SUM(Population_pays*0.01*Percentage) as Part_parlant_anglais, SUM(Population_pays) as Population_totale\n", "FROM (SELECT Language, Code, Percentage, SUM(Population) as Population_pays\n", "FROM \n", " Country\n", "LEFT JOIN \n", " CountryLanguage \n", "ON Country.Code = CountryLanguage.CountryCode\n", "WHERE HeadOfState = \"Elisabeth II\" AND Language == \"English\"\n", "GROUP BY Code)''').fetchall()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Conclusion: il vaut mieux \u00e9crire deux requ\u00eates simples et lisibles pour obtenir le bon r\u00e9sultat, plut\u00f4t qu'une requ\u00eate qui fait tout en une seule passe mais dont on va devoir v\u00e9rifier la correction longuement..."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Question 4 - passons \u00e0 Pandas\n", "Cr\u00e9er une DataFrame qui contient les informations suivantes par pays :\n", "- le nom\n", "- le code du pays\n", "- le nombre de langues parl\u00e9es\n", "- le nombre de langues officielles\n", "- la population\n", "- le GNP\n", "- l'esp\u00e9rance de vie\n", "\n", "**Indice : utiliser la commande pd.read_sql_query**\n", "\n", "\n", "Que dit la matrice de corr\u00e9lation de ces variables ?"]}, {"cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CodeNamePopulationGNPLifeExpectancyNb_langues_parleesNb_langues_officielles
0ABWAruba103000828.078.441
1AFGAfghanistan227200005976.045.952
2AGOAngola128780006648.038.3131
3AIAAnguilla800063.276.111
4ALBAlbania34012003205.071.631
\n", "
"], "text/plain": [" Code Name Population GNP LifeExpectancy Nb_langues_parlees \\\n", "0 ABW Aruba 103000 828.0 78.4 4 \n", "1 AFG Afghanistan 22720000 5976.0 45.9 5 \n", "2 AGO Angola 12878000 6648.0 38.3 13 \n", "3 AIA Anguilla 8000 63.2 76.1 1 \n", "4 ALB Albania 3401200 3205.0 71.6 3 \n", "\n", " Nb_langues_officielles \n", "0 1 \n", "1 2 \n", "2 1 \n", "3 1 \n", "4 1 "]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["df = pd.read_sql_query('''SELECT Code, Name, Population, GNP , LifeExpectancy,\n", " COUNT(*) as Nb_langues_parlees, SUM(IsOfficial) as Nb_langues_officielles\n", " FROM Country\n", " INNER JOIN CountryLanguage ON Country.Code = CountryLanguage.CountryCode\n", " GROUP BY Country.Code''',\n", " CreateDataBase)\n", "df.head()"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationGNPLifeExpectancyNb_langues_parleesNb_langues_officielles
Population1.0000000.2803760.0067740.398681-0.010049
GNP0.2803761.0000000.1640940.2601810.019451
LifeExpectancy0.0067740.1640941.000000-0.3179620.270675
Nb_langues_parlees0.3986810.260181-0.3179621.000000-0.084946
Nb_langues_officielles-0.0100490.0194510.270675-0.0849461.000000
\n", "
"], "text/plain": [" Population GNP LifeExpectancy \\\n", "Population 1.000000 0.280376 0.006774 \n", "GNP 0.280376 1.000000 0.164094 \n", "LifeExpectancy 0.006774 0.164094 1.000000 \n", "Nb_langues_parlees 0.398681 0.260181 -0.317962 \n", "Nb_langues_officielles -0.010049 0.019451 0.270675 \n", "\n", " Nb_langues_parlees Nb_langues_officielles \n", "Population 0.398681 -0.010049 \n", "GNP 0.260181 0.019451 \n", "LifeExpectancy -0.317962 0.270675 \n", "Nb_langues_parlees 1.000000 -0.084946 \n", "Nb_langues_officielles -0.084946 1.000000 "]}, "execution_count": 36, "metadata": {}, "output_type": "execute_result"}], "source": ["df.corr()"]}, {"cell_type": "code", "execution_count": 36, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"anaconda-cloud": {}, "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}