SQL Magic Commands with SQLite in a Notebook¶

Links: notebook, html, PDF, python, slides, GitHub

SQL from a notebooks, using magic commands to query a sqllite3 database.

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

This notebook introduced some of the magic commands used to easily query a SQLite database with sqlite3. SQLite databases can be easily manipulated with open source tools such as SQLite Manager (add-on for firefox) or SQLiteSpy (only on Windows). However, it is very convenient to put the results of a SQL query into a DataFrame. That what’s this notebook is about.

Let’s start by importing some data.

import pyensae
import pyensae.datasource
%load_ext pyensae
pyensae.datasource.download_data("velib_vanves.zip", website = "xd")

['.\velib_vanves.txt']

import os
if os.path.exists("notebook_example.db3"):
    os.remove("notebook_example.db3")

We connect to the database:

%SQL_connect notebook_example.db3

<pyensae.sql.sql_interface_database.InterfaceSQLDatabase at 0x225e8b6eef0>

The database is empty so the table list should be empty:

%SQL_tables

[]

So we import a flat file (TSV format only) into the database.

%SQL_import_tsv -t velib velib_vanves.txt

We check there is one more table:

%SQL_tables

['velib']

We get the schema of the database:

%SQL_schema velib

{0: ('address', str),
('available_bike_stands', int),
('available_bikes', int),
('banking', int),
('bike_stands', int),
('bonus', int),
('contract_name', str),
('last_update', str),
('lat', str),
('lng', str),
('name', str),
('number', int),
('status', str),
('idr', int)}

And we execute the first query:

%%SQL
SELECT MAX(available_bike_stands +available_bikes) FROM velib

	MAX(available_bike_stands +available_bikes)
0	71

Or another in one row:

%SQL -q "SELECT COUNT(*) FROM velib"

	COUNT(*)
0	9461

We want to draw a random sample out of this table. We ask the database to recognize the following python function (its name cannot contain '_'):

import random
def arandomfunction():
    return random.randint(1,100)
%SQL_add_function arandomfunction

We then execute the following query. Because the first line contains a identifier, the query will only display the shape of the results dataframe and not its content which will be stored in df:

%%SQL --df=df
SELECT * FROM (
    SELECT *,arandomfunction() AS rnd FROM velib)
    WHERE rnd==1

	address	available_bike_stands	available_bikes	bike_stands	contract_name	last_update	lat	lng	name	number	status	idr	rnd
0	112 RUE VERCINGETORIX - 75014 PARIS	66	1	67	Paris	15/07/2013 16:40	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	25249	58
1	112 RUE VERCINGETORIX - 75014 PARIS	53	14	67	Paris	15/07/2013 22:20	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	108821	76
2	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	16/07/2013 13:50	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	337415	17
3	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	16/07/2013 14:15	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	343560	60
4	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	17/07/2013 04:05	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	547574	20
5	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	17/07/2013 14:10	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	696283	10
6	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	18/07/2013 02:05	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	872030	89
7	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	18/07/2013 03:30	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	892923	8
8	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	18/07/2013 03:40	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	895381	42
9	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	18/07/2013 07:45	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	955602	15

We look at the head:

df.head()

	address	available_bike_stands	available_bikes	bike_stands	contract_name	last_update	lat	lng	name	number	status	idr	rnd
0	112 RUE VERCINGETORIX - 75014 PARIS	66	1	67	Paris	15/07/2013 16:40	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	25249	58
1	112 RUE VERCINGETORIX - 75014 PARIS	53	14	67	Paris	15/07/2013 22:20	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	108821	76
2	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	16/07/2013 13:50	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	337415	17
3	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	16/07/2013 14:15	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	343560	60
4	112 RUE VERCINGETORIX - 75014 PARIS	61	6	67	Paris	17/07/2013 04:05	48,83425925	2,313391647	14029 - GERGOVIE VERCINGETORIX	14029	OPEN	547574	20

If you into column rnd, you can see that it does not contain the value if was filtered on. It is probably because the python function is evaluated twice. We add this new dataframe to the database:

%SQL_import_df -t sample df

We check it is part the list of tables:

%SQL_tables

['sample', 'velib']

We do some more queries:

%%SQL

SELECT insample, COUNT(*) AS nb FROM (
    SELECT velib.*, sample.idr AS insample
    FROM velib OUTER JOIN sample
    ON velib.idr == sample.idr
)

unable to execute a SQL request (1)(file notebook_example.db3)
SELECT insample, COUNT(*) AS nb FROM (
    SELECT velib.*, sample.idr AS insample
    FROM velib OUTER JOIN sample
    ON velib.idr == sample.idr
)
RIGHT and FULL OUTER JOINs are not currently supported
'SELECT insample, COUNT(*) AS nb FROM (
    SELECT velib.*, sample.idr AS insample
    FROM velib OUTER JOIN sample
    ON velib.idr == sample.idr
)'

We fix it:

%%SQL

SELECT insample, COUNT(*) AS nb FROM (
    SELECT velib.idr, sample.idr is not null insample
    FROM velib LEFT OUTER JOIN sample
    ON velib.idr == sample.idr
) GROUP BY insample

	insample	nb
0	0	9375
1	1	86

Autocompletion also works when the prefix is DB.CC.:

%%SQL --df=dfo
SELECT DB.CC.velib.available_bike_stands, COUNT(*)
FROM DB.CC.velib
GROUP BY DB.CC.velib.available_bike_stands

	available_bike_stands	COUNT(*)
0	4	60
1	5	7
2	6	7
3	7	1
4	8	1
5	9	4
6	10	37
7	11	62
8	12	53
9	13	27

dfo.shape

(64, 2)

The autocompletion looks like:

from pyquickhelper.helpgen import NbImage
NbImage("dbcc.png")

We finally draw a graph:

dfo.plot(x="available_bike_stands", y="COUNT(*)", kind="bar", figsize=(16,4))

<matplotlib.axes._subplots.AxesSubplot at 0x225e8fd37f0>

And we finally close the connection:

%SQL_close

END

SQL Magic Commands with SQLite in a Notebook¶

Links

Contents

Information

Related Topics

This Page