.. _citybikeviewsrst: =============== City Bike Views =============== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/challenges/city_bike/city_bike_views.ipynb|*` Based on the data available at `Divvy Data `__, some ways to look at the data. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline The data -------- `Divvy Data `__ publishes a sample of the data. .. code:: ipython3 from pyensae.datasource import download_data file = download_data("Divvy_Trips_2016_Q3Q4.zip", url="https://s3.amazonaws.com/divvy-data/tripdata/") .. code:: ipython3 import pandas stations = pandas.read_csv("Divvy_Stations_2016_Q3.csv") bikes = pandas.concat([pandas.read_csv("Divvy_Trips_2016_Q3.csv"), pandas.read_csv("Divvy_Trips_2016_Q4.csv")]) .. code:: ipython3 bikes.head() .. raw:: html

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_id	from_station_name	to_station_id	to_station_name	usertype	gender	birthyear
0	12150160	9/30/2016 23:59:58	10/1/2016 00:04:03	4959	245	69	Damen Ave & Pierce Ave	17	Wood St & Division St	Subscriber	Male	1988.0
1	12150159	9/30/2016 23:59:58	10/1/2016 00:04:09	2589	251	383	Ashland Ave & Harrison St	320	Loomis St & Lexington St	Subscriber	Female	1990.0
2	12150158	9/30/2016 23:59:51	10/1/2016 00:24:51	3656	1500	302	Sheffield Ave & Wrightwood Ave	334	Lake Shore Dr & Belmont Ave	Customer	NaN	NaN
3	12150157	9/30/2016 23:59:51	10/1/2016 00:03:56	3570	245	475	Washtenaw Ave & Lawrence Ave	471	Francisco Ave & Foster Ave	Subscriber	Female	1988.0
4	12150156	9/30/2016 23:59:32	10/1/2016 00:26:50	3158	1638	302	Sheffield Ave & Wrightwood Ave	492	Leavitt St & Addison St	Customer	NaN	NaN

About age --------- .. code:: ipython3 from datetime import datetime, time df = bikes df["dtstart"] = pandas.to_datetime(df.starttime, infer_datetime_format=True) df["dtstop"] = pandas.to_datetime(df.stoptime, infer_datetime_format=True) df["stopday"] = df.dtstop.apply(lambda r: datetime(r.year, r.month, r.day)) df["stoptime"] = df.dtstop.apply(lambda r: time(r.hour, r.minute, 0)) df["stoptime10"] = df.dtstop.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0)) # every 10 minutes df['stopweekday'] = df['dtstop'].dt.dayofweek .. code:: ipython3 df['duration'] = df["dtstop"] - df["dtstart"] df["age"] = - df["birthyear"] + 2016 df['duration_sec'] = df['duration'].apply(lambda x: x.total_seconds()) .. code:: ipython3 df["stoptime_sec"] = df.dtstop.apply(lambda r: r.hour * 60 + r.minute) .. code:: ipython3 df.describe().T .. raw:: html

	count	mean	std	min	25%	50%	75%	max
trip_id	2.12564e+06	1.16993e+07	731388	1.04267e+07	1.10663e+07	1.17015e+07	1.23313e+07	1.29792e+07
bikeid	2.12564e+06	3251.98	1730.44	1	1755	3446	4802	5920
tripduration	2.12564e+06	1008.55	1816.1	60	416	716	1195	86365
from_station_id	2.12564e+06	179.916	130.524	2	75	157	268	620
to_station_id	2.12564e+06	180.352	130.488	2	75	157	272	620
birthyear	1.59034e+06	1980.79	10.754	1899	1975	1984	1989	2000
stopweekday	2.12564e+06	2.95275	2.02016	0	1	3	5	6
duration	2125643	0 days 00:16:48.183487	0 days 00:30:15.597468	0 days 00:00:59	0 days 00:06:56	0 days 00:11:56	0 days 00:19:55	0 days 23:59:24
age	1.59034e+06	35.2129	10.754	16	27	32	41	117
duration_sec	2.12564e+06	1008.18	1815.6	59	416	716	1195	86364
stoptime_sec	2.12564e+06	869.324	284.732	0	652	919	1080	1439

.. code:: ipython3 df.shape .. parsed-literal:: (2125643, 21) We take a random sample. .. code:: ipython3 import random ens = pandas.Series([random.randint(0,99) for i in range(df.shape[0])]) sample = df[ens==0] .. parsed-literal:: c:\python370_x64\lib\site-packages\ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index. This is separate from the ipykernel package so we can avoid doing imports until .. code:: ipython3 sample.shape sample = sample[(sample.age < 100) & (sample.duration_sec < 3600)] .. code:: ipython3 import numpy as np import pandas as pd import seaborn as sns sns.set(style="white") g = sns.jointplot(sample.age, sample.duration_sec, kind="kde", size=7, space=0) .. parsed-literal:: c:\python370_x64\lib\site-packages\seaborn\axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning) c:\python370_x64\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval .. image:: city_bike_views_16_1.png .. code:: ipython3 import numpy as np import pandas as pd import seaborn as sns sns.set(style="white") g = sns.jointplot(sample.age, sample.stoptime_sec, kind="kde", size=7, space=0) .. parsed-literal:: c:\python370_x64\lib\site-packages\seaborn\axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning) c:\python370_x64\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval .. image:: city_bike_views_17_1.png The duration seems correlated to the age. Let’s see. Younger people during the weekend are more active and bike longer. .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1,2, figsize=(14,4)) sns.boxplot(x="stopweekday", y="age", data=df[df.age < 100], color="c", ax=ax[0]) sns.boxplot(x="stopweekday", y="duration_sec", data=df[df.duration_sec < 3600], color="c", ax=ax[1]); .. image:: city_bike_views_19_0.png However, linear correlations are not so great. .. code:: ipython3 df.corr() .. raw:: html

	trip_id	bikeid	tripduration	from_station_id	to_station_id	birthyear	stopweekday	age	duration_sec	stoptime_sec
trip_id	1.000000	-0.025039	-0.071566	0.008129	0.005505	-0.035032	-0.067585	0.035032	-0.071514	-0.063709
bikeid	-0.025039	1.000000	0.001088	0.009959	0.009538	-0.010901	0.000779	0.010901	0.001090	-0.007245
tripduration	-0.071566	0.001088	1.000000	-0.008972	-0.004730	-0.009788	0.069885	0.009788	0.999969	0.034435
from_station_id	0.008129	0.009959	-0.008972	1.000000	0.386314	0.019982	0.019426	-0.019982	-0.008970	-0.016486
to_station_id	0.005505	0.009538	-0.004730	0.386314	1.000000	0.021198	0.011177	-0.021198	-0.004734	0.061833
birthyear	-0.035032	-0.010901	-0.009788	0.019982	0.021198	1.000000	0.057081	-1.000000	-0.009808	0.085929
stopweekday	-0.067585	0.000779	0.069885	0.019426	0.011177	0.057081	1.000000	-0.057081	0.069848	0.017866
age	0.035032	0.010901	0.009788	-0.019982	-0.021198	-1.000000	-0.057081	1.000000	0.009808	-0.085929
duration_sec	-0.071514	0.001090	0.999969	-0.008970	-0.004734	-0.009808	0.069848	0.009808	1.000000	0.034514
stoptime_sec	-0.063709	-0.007245	0.034435	-0.016486	0.061833	0.085929	0.017866	-0.085929	0.034514	1.000000

.. code:: ipython3 fig, ax = plt.subplots(1,2, figsize=(14,4)) sns.violinplot(x="stopweekday", y="duration_sec", hue="gender", data=sample, split=True, inner="quart", ax=ax[0]) sns.violinplot(x="stopweekday", y="age", hue="gender", data=sample, split=True, inner="quart", ax=ax[1]) sns.despine(left=True); .. image:: city_bike_views_22_0.png .. code:: ipython3 fig, ax = plt.subplots(1,2, figsize=(14,4)) sns.violinplot(x="stopweekday", y="stoptime_sec", hue="gender", data=sample, split=True, inner="quart", ax=ax[0]) sns.violinplot(x="usertype", y="stoptime_sec", hue="gender", data=sample, split=True, inner="quart", ax=ax[1]) sns.despine(left=True); .. image:: city_bike_views_23_0.png .. code:: ipython3 fig, ax = plt.subplots(1,2, figsize=(14,4)) sns.violinplot(x="usertype", y="duration_sec", hue="gender", data=sample, split=True, inner="quart", ax=ax[0]) sns.violinplot(x="usertype", y="age", hue="gender", data=sample, split=True, inner="quart", ax=ax[1]) sns.despine(left=True); .. image:: city_bike_views_24_0.png Non-linear correlations ----------------------- We apply the following `Corrélations non linéaires `__. .. code:: ipython3 sample2 = sample.copy() sample2["age_inv"] = sample2.age ** -1 sample2["gender_num"] = sample2.gender.apply(lambda x: (1 if x == "Male" else 0)) sample2["usertype_num"] = sample2.usertype.apply(lambda x: (1 if x == "Subscriber" else 0)) sample2.corr() .. raw:: html

	trip_id	bikeid	tripduration	from_station_id	to_station_id	birthyear	stopweekday	age	duration_sec	stoptime_sec	age_inv	gender_num	usertype_num
trip_id	1.000000	-0.020790	-0.090476	-0.000107	-0.013945	-0.032304	-0.044524	0.032304	-0.089994	-0.050106	-0.027614	0.058240	0.014886
bikeid	-0.020790	1.000000	0.011953	0.009183	0.013776	-0.011606	-0.001548	0.011606	0.011944	-0.004755	-0.009776	0.031241	-0.000271
tripduration	-0.090476	0.011953	1.000000	0.044958	0.050456	-0.005926	0.063156	0.005926	0.999999	0.073529	-0.016003	-0.108492	-0.008061
from_station_id	-0.000107	0.009183	0.044958	1.000000	0.381516	0.021112	0.051470	-0.021112	0.044957	-0.010079	0.029148	-0.027132	0.019338
to_station_id	-0.013945	0.013776	0.050456	0.381516	1.000000	0.027436	0.049334	-0.027436	0.050440	0.089843	0.033983	-0.031992	0.014013
birthyear	-0.032304	-0.011606	-0.005926	0.021112	0.027436	1.000000	0.047890	-1.000000	-0.005937	0.090793	0.951178	-0.080894	0.008748
stopweekday	-0.044524	-0.001548	0.063156	0.051470	0.049334	0.047890	1.000000	-0.047890	0.063146	-0.005645	0.052167	-0.043754	-0.009274
age	0.032304	0.011606	0.005926	-0.021112	-0.027436	-1.000000	-0.047890	1.000000	0.005937	-0.090793	-0.951178	0.080894	-0.008748
duration_sec	-0.089994	0.011944	0.999999	0.044957	0.050440	-0.005937	0.063146	0.005937	1.000000	0.073511	-0.016010	-0.108476	-0.008063
stoptime_sec	-0.050106	-0.004755	0.073529	-0.010079	0.089843	0.090793	-0.005645	-0.090793	0.073511	1.000000	0.091370	-0.003124	-0.005995
age_inv	-0.027614	-0.009776	-0.016003	0.029148	0.033983	0.951178	0.052167	-0.951178	-0.016010	0.091370	1.000000	-0.088716	0.009361
gender_num	0.058240	0.031241	-0.108492	-0.027132	-0.031992	-0.080894	-0.043754	0.080894	-0.108476	-0.003124	-0.088716	1.000000	-0.009875
usertype_num	0.014886	-0.000271	-0.008061	0.019338	0.014013	0.008748	-0.009274	-0.008748	-0.008063	-0.005995	0.009361	-0.009875	1.000000

.. code:: ipython3 from sklearn.model_selection import train_test_split from sklearn.preprocessing import scale import numpy def correlation_cross_val(df, model, draws=5, **params): cor = df.corr() df = scale(df[cor.columns]) for i in range(cor.shape[0]): xi = df[:, i:i+1] for j in range(cor.shape[1]): xj = df[:, j] mem = [] for k in range(0, draws): xi_train, xi_test, xj_train, xj_test = train_test_split(xi, xj, train_size=0.5) mod = model(**params) mod.fit(xi_train, xj_train) v = mod.predict(xi_test) c = (1 - numpy.var(v - xj_test)) mem.append(max(c, 0) **0.5) cor.iloc[i,j] = sum(mem) / len(mem) return cor from sklearn.tree import DecisionTreeRegressor cor = correlation_cross_val(sample2, DecisionTreeRegressor, draws=20) cor .. parsed-literal:: c:\python370_x64\lib\site-packages\sklearn\model_selection\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified. FutureWarning) .. raw:: html

	trip_id	bikeid	tripduration	from_station_id	to_station_id	birthyear	stopweekday	age	duration_sec	stoptime_sec	age_inv	gender_num	usertype_num
trip_id	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.990057	0.000000	0.000000	0.878676	0.000000	0.000000	0.015071
bikeid	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.017745
tripduration	0.000000	0.000000	0.999990	0.000000	0.000000	0.000000	0.000000	0.000000	0.999990	0.000000	0.000000	0.000000	0.099688
from_station_id	0.000000	0.000000	0.054325	0.999999	0.536015	0.008031	0.000000	0.006133	0.044061	0.012246	0.028895	0.000000	0.172352
to_station_id	0.000000	0.000000	0.109784	0.532589	0.999999	0.026304	0.000000	0.024202	0.105040	0.248720	0.019269	0.000000	0.153788
birthyear	0.011917	0.005125	0.058987	0.026033	0.034188	0.999965	0.006681	0.999942	0.053868	0.055993	0.999996	0.068499	0.209815
stopweekday	0.042561	0.020920	0.079774	0.065344	0.045880	0.085778	1.000000	0.055188	0.074982	0.033219	0.095588	0.054960	0.291366
age	0.011934	0.006577	0.051631	0.019733	0.028614	0.999909	0.017197	0.999943	0.063061	0.052308	0.999997	0.055599	0.332540
duration_sec	0.000000	0.000000	0.999991	0.000000	0.000000	0.000000	0.000000	0.000000	0.999992	0.000000	0.000000	0.000000	0.026094
stoptime_sec	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.114911
age_inv	0.017233	0.009412	0.042053	0.028141	0.010642	0.999918	0.024621	0.999912	0.022328	0.051536	0.999997	0.070596	0.275990
gender_num	0.049358	0.032177	0.073034	0.038476	0.065741	0.080659	0.039153	0.062295	0.079421	0.052172	0.085579	1.000000	0.245688
usertype_num	0.032215	0.041777	0.063852	0.039527	0.032406	0.038315	0.037363	0.055648	0.059596	0.037164	0.039390	0.035496	1.000000

*from_station_id* and *start_station_id* seem related. Which means there is frequent trip. Funny, the trip id can explain the stopping time… It should be removed from any dataset. .. code:: ipython3 sample2.plot(x="trip_id", y="stoptime_sec", kind="scatter", figsize=(14,4)); .. image:: city_bike_views_29_0.png