Bike Pattern 2#

Links: notebook, html, PDF, python, slides, GitHub

We used a little bit of machine learning on Divvy Data to dig into a better division of Chicago. We try to identify patterns among bike stations.

from jyquickhelper import add_notebook_menu
add_notebook_menu()

%matplotlib inline

The data #

Divvy Data publishes a sample of the data.

from pyensae.datasource import download_data
file = download_data("Divvy_Trips_2016_Q3Q4.zip", url="https://s3.amazonaws.com/divvy-data/tripdata/")

We know the stations.

import pandas
stations = pandas.read_csv("Divvy_Stations_2016_Q3.csv")
bikes = pandas.concat([pandas.read_csv("Divvy_Trips_2016_Q3.csv"),
                       pandas.read_csv("Divvy_Trips_2016_Q4.csv")])

bikes.head()

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_id	from_station_name	to_station_id	to_station_name	usertype	gender	birthyear
0	12150160	9/30/2016 23:59:58	10/1/2016 00:04:03	4959	245	69	Damen Ave & Pierce Ave	17	Wood St & Division St	Subscriber	Male	1988.0
1	12150159	9/30/2016 23:59:58	10/1/2016 00:04:09	2589	251	383	Ashland Ave & Harrison St	320	Loomis St & Lexington St	Subscriber	Female	1990.0
2	12150158	9/30/2016 23:59:51	10/1/2016 00:24:51	3656	1500	302	Sheffield Ave & Wrightwood Ave	334	Lake Shore Dr & Belmont Ave	Customer	NaN	NaN
3	12150157	9/30/2016 23:59:51	10/1/2016 00:03:56	3570	245	475	Washtenaw Ave & Lawrence Ave	471	Francisco Ave & Foster Ave	Subscriber	Female	1988.0
4	12150156	9/30/2016 23:59:32	10/1/2016 00:26:50	3158	1638	302	Sheffield Ave & Wrightwood Ave	492	Leavitt St & Addison St	Customer	NaN	NaN

from datetime import datetime, time
df = bikes
df["dtstart"] = pandas.to_datetime(df.starttime, infer_datetime_format=True)
df["dtstop"] = pandas.to_datetime(df.stoptime, infer_datetime_format=True)

df["stopday"] = df.dtstop.apply(lambda r: datetime(r.year, r.month, r.day))
df["stoptime"] = df.dtstop.apply(lambda r: time(r.hour, r.minute, 0))
df["stoptime10"] = df.dtstop.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0))  # every 10 minutes

df["startday"] = df.dtstart.apply(lambda r: datetime(r.year, r.month, r.day))
df["starttime"] = df.dtstart.apply(lambda r: time(r.hour, r.minute, 0))
df["starttime10"] = df.dtstart.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0))  # every 10 minutes

df['stopweekday'] = df['dtstop'].dt.dayofweek
df['startweekday'] = df['dtstart'].dt.dayofweek

Normalize, aggregating and merging per start and stop time #

key = ["to_station_id", "to_station_name", "stopweekday", "stoptime10"]
keep = key + ["trip_id"]

stopaggtime = df[keep].groupby(key, as_index=False).count()
stopaggtime.columns = key + ["nb_trips"]

stopaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
stopaggday.columns = key[:-1] + ["nb_trips"]

stopaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
stopaggday.columns = key[:-1] + ["nb_trips"]

stopmerge = stopaggtime.merge(stopaggday, on=key[:-1], suffixes=("", "day"))
stopmerge["stopdist"] = stopmerge["nb_trips"] / stopmerge["nb_tripsday"]
stopmerge.head()

	to_station_id	to_station_name	stoptime10	nb_trips	nb_tripsday	stopdist
0	2	Michigan Ave & Balbo Ave	00:10:00	2	913	0.002191
1	2	Michigan Ave & Balbo Ave	00:20:00	2	913	0.002191
2	2	Michigan Ave & Balbo Ave	00:30:00	2	913	0.002191
3	2	Michigan Ave & Balbo Ave	01:00:00	3	913	0.003286
4	2	Michigan Ave & Balbo Ave	01:10:00	2	913	0.002191

stopmerge[stopmerge["to_station_id"] == 2] \
       .plot(x="stoptime10", y="stopdist", figsize=(14,4), kind="area");

../_images/city_bike_solution_cluster_start_12_0.png

key = ["from_station_id", "from_station_name", "startweekday", "starttime10"]
keep = key + ["trip_id"]

startaggtime = df[keep].groupby(key, as_index=False).count()
startaggtime.columns = key + ["nb_trips"]

startaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
startaggday.columns = key[:-1] + ["nb_trips"]

startaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
startaggday.columns = key[:-1] + ["nb_trips"]

startmerge = startaggtime.merge(startaggday, on=key[:-1], suffixes=("", "day"))
startmerge["startdist"] = startmerge["nb_trips"] / startmerge["nb_tripsday"]
startmerge.head()

	from_station_id	from_station_name	starttime10	nb_trips	nb_tripsday	startdist
0	2	Michigan Ave & Balbo Ave	00:00:00	4	1065	0.003756
1	2	Michigan Ave & Balbo Ave	00:10:00	1	1065	0.000939
2	2	Michigan Ave & Balbo Ave	00:20:00	3	1065	0.002817
3	2	Michigan Ave & Balbo Ave	00:50:00	4	1065	0.003756
4	2	Michigan Ave & Balbo Ave	01:10:00	3	1065	0.002817

startmerge[startmerge["from_station_id"] == 2] \
         .plot(x="starttime10",  y="startdist", figsize=(14,4), kind="area");

../_images/city_bike_solution_cluster_start_14_0.png

everything = stopmerge.merge(startmerge,
                             left_on=["to_station_id", "to_station_name", "stopweekday", "stoptime10"],
                             right_on=["from_station_id", "from_station_name","startweekday", "starttime10"],
                             suffixes=("stop", "start"),
                             how="outer")
everything.head()

	to_station_id	to_station_name	stoptime10	nb_tripsstop	nb_tripsdaystop	stopdist	from_station_id	from_station_name	startweekday	starttime10	nb_tripsstart	nb_tripsdaystart	startdist
0	2.0	Michigan Ave & Balbo Ave	00:10:00	2.0	913.0	0.002191	2.0	Michigan Ave & Balbo Ave	0.0	00:10:00	1.0	1065.0	0.000939
1	2.0	Michigan Ave & Balbo Ave	00:20:00	2.0	913.0	0.002191	2.0	Michigan Ave & Balbo Ave	0.0	00:20:00	3.0	1065.0	0.002817
2	2.0	Michigan Ave & Balbo Ave	00:30:00	2.0	913.0	0.002191	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2.0	Michigan Ave & Balbo Ave	01:00:00	3.0	913.0	0.003286	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2.0	Michigan Ave & Balbo Ave	01:10:00	2.0	913.0	0.002191	2.0	Michigan Ave & Balbo Ave	0.0	01:10:00	3.0	1065.0	0.002817

import numpy
from datetime import datetime

def bestof(x, y):
    if isinstance(x, (datetime, time, str)):
        return x
    try:
        if x is None or isinstance(y, (datetime, time, str)) or numpy.isnan(x):
            return y
        else:
            return x
    except:
        print(type(x), type(y))
        print(x, y)
        raise

bestof(datetime(2017,2,2), numpy.nan), bestof(numpy.nan, datetime(2017,2,2))

(datetime.datetime(2017, 2, 2, 0, 0), datetime.datetime(2017, 2, 2, 0, 0))

every = everything.copy()
every["station_name"] = every.apply(lambda row: bestof(row["to_station_name"], row["from_station_name"]), axis=1)
every["station_id"] = every.apply(lambda row: bestof(row["to_station_id"], row["from_station_id"]), axis=1)
every["time10"] = every.apply(lambda row: bestof(row["stoptime10"], row["starttime10"]), axis=1)
every["weekday"] = every.apply(lambda row: bestof(row["stopweekday"], row["startweekday"]), axis=1)
every = every.drop(["stoptime10", "starttime10", "stopweekday", "startweekday",
                    "to_station_id", "from_station_id",
                    "to_station_name", "from_station_name"], axis=1)
every.head()

	nb_tripsstop	nb_tripsdaystop	stopdist	nb_tripsstart	nb_tripsdaystart	startdist	station_name	station_id	time10
0	2.0	913.0	0.002191	1.0	1065.0	0.000939	Michigan Ave & Balbo Ave	2.0	00:10:00
1	2.0	913.0	0.002191	3.0	1065.0	0.002817	Michigan Ave & Balbo Ave	2.0	00:20:00
2	2.0	913.0	0.002191	NaN	NaN	NaN	Michigan Ave & Balbo Ave	2.0	00:30:00
3	3.0	913.0	0.003286	NaN	NaN	NaN	Michigan Ave & Balbo Ave	2.0	01:00:00
4	2.0	913.0	0.002191	3.0	1065.0	0.002817	Michigan Ave & Balbo Ave	2.0	01:10:00

every.shape, stopmerge.shape, startmerge.shape

((357809, 10), (298013, 7), (299700, 7))

We need vectors of equal size which means filling NaN values with 0 and adding times when not present.

every.columns

Index(['nb_tripsstop', 'nb_tripsdaystop', 'stopdist', 'nb_tripsstart',
       'nb_tripsdaystart', 'startdist', 'station_name', 'station_id', 'time10',
       'weekday'],
      dtype='object')

keys = ['station_name', 'station_id', 'weekday', 'time10']
for c in every.columns:
    if c not in keys:
        every[c].fillna(0)

from ensae_projects.datainc.data_bikes import add_missing_time
full = df = add_missing_time(every, delay=10,  column="time10",
                             values=[c for c in every.columns if c not in keys])
full = full[['station_name', 'station_id', 'time10', 'weekday',
             'stopdist', 'startdist',
             'nb_tripsstop', 'nb_tripsdaystop', 'nb_tripsstart',
       'nb_tripsdaystart']].sort_values(keys)
full.head()

	station_name	station_id	time10	startdist	nb_tripsstart	nb_tripsdaystart
357809	2112 W Peterson Ave	456.0	00:00:00	0.000000	0.0	0.0
357810	2112 W Peterson Ave	456.0	00:10:00	0.000000	0.0	0.0
357811	2112 W Peterson Ave	456.0	00:20:00	0.000000	0.0	0.0
341443	2112 W Peterson Ave	456.0	00:30:00	0.021277	1.0	47.0
357812	2112 W Peterson Ave	456.0	00:40:00	0.000000	0.0	0.0

Clustering (stop and start)#

We cluster these distribution to find some patterns. But we need vectors of equal size which should be equal to 24*6.

This is much better.

df = full
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2, 1, figsize=(14,6))
df[df["station_id"] == 2].plot(x="time10", y="startdist", figsize=(14,4), kind="area", ax=ax[0])
df[df["station_id"] == 2].plot(x="time10", y="stopdist", figsize=(14,4),
                               kind="area", ax=ax[1], color="r");

../_images/city_bike_solution_cluster_start_25_0.png

Let’s build the features.

features = df.pivot_table(index=["station_id", "station_name", "weekday"],
                          columns="time10", values=["startdist", "stopdist"]).reset_index()

features.head()

	station_id	station_name	weekday	startdist							...	stopdist
time10				00:00:00	00:10:00	00:20:00	00:30:00	00:40:00	00:50:00	01:00:00	...	22:20:00	22:30:00	22:40:00	22:50:00	23:00:00	23:10:00	23:20:00	23:30:00	23:40:00	23:50:00
0	2.0	Michigan Ave & Balbo Ave	0.0	0.003756	0.000939	0.002817	0.000000	0.000000	0.003756	0.000000	...	0.004381	0.002191	0.004381	0.002191	0.004381	0.004381	0.005476	0.002191	0.000000	0.005476
1	2.0	Michigan Ave & Balbo Ave	1.0	0.000000	0.000000	0.001106	0.001106	0.001106	0.002212	0.000000	...	0.009371	0.012048	0.006693	0.004016	0.005355	0.006693	0.002677	0.000000	0.000000	0.000000
2	2.0	Michigan Ave & Balbo Ave	2.0	0.001357	0.002714	0.000000	0.001357	0.000000	0.005427	0.000000	...	0.002907	0.002907	0.015988	0.005814	0.001453	0.001453	0.011628	0.000000	0.000000	0.007267
3	2.0	Michigan Ave & Balbo Ave	3.0	0.000000	0.004144	0.000000	0.000000	0.002762	0.004144	0.000000	...	0.009274	0.003091	0.003091	0.007728	0.001546	0.003091	0.009274	0.001546	0.007728	0.001546
4	2.0	Michigan Ave & Balbo Ave	4.0	0.000000	0.000000	0.000000	0.002846	0.000000	0.000000	0.000949	...	0.008214	0.001027	0.006160	0.004107	0.015400	0.006160	0.002053	0.006160	0.007187	0.000000

5 rows × 291 columns

names = features.columns[3:]
len(names)

from sklearn.cluster import KMeans
clus = KMeans(8)
clus.fit(features[names])

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

pred = clus.predict(features[names])
set(pred)

{0, 1, 2, 3, 4, 5, 6, 7}

features["cluster"] = pred

Let’s see what it means accross day. We need to look whether or not a cluster is related to day of the working week or the week end.

features[["cluster", "weekday", "station_id"]].groupby(["cluster", "weekday"]).count()

c:python370_x64libsite-packagespandascoregeneric.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)

		station_id
	time10
cluster	weekday
0	3.0	1
	4.0	1
	6.0	1
1	0.0	146
	1.0	110
	2.0	106
	3.0	119
	4.0	143
	5.0	553
	6.0	547
2	0.0	137
	1.0	141
	2.0	150
	3.0	147
	4.0	149
	5.0	8
	6.0	12
3	0.0	1
3	3.0	1
4	2.0	1
	3.0	1
	6.0	1
5	6.0	1
6	0.0	1
	1.0	1
	2.0	1
	6.0	1
7	0.0	291
	1.0	326
	2.0	322
	3.0	308
	4.0	287
	5.0	19
	6.0	17

nb = features[["cluster", "weekday", "station_id"]].groupby(["cluster", "weekday"]).count()
nb = nb.reset_index()
nb[nb.cluster.isin([0, 3, 5, 6])].pivot("weekday","cluster", "station_id").plot(kind="bar");

c:python370_x64libsite-packagespandascoregeneric.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)

../_images/city_bike_solution_cluster_start_34_1.png

Let’s draw the clusters.

centers = clus.cluster_centers_.T
import matplotlib.pyplot as plt
fig, ax = plt.subplots(centers.shape[1], 2, figsize=(10,10))
nbf = centers.shape[0] // 2
x = list(range(0,nbf))
col = 0
dec = 0
colors = ["red", "yellow", "gray", "green", "brown", "orange", "blue"]
for i in range(centers.shape[1]):
    if 2*i == centers.shape[1]:
        col += 1
        dec += centers.shape[1]
    color = colors[i%len(colors)]
    ax[2*i-dec, col].bar (x, centers[:nbf,i], width=1.0, color=color)
    ax[2*i-dec, col].set_ylabel("cluster %d - start" % i, color=color)
    ax[2*i+1-dec, col].bar (x, centers[nbf:,i], width=1.0, color=color)
    ax[2*i+1-dec, col].set_ylabel("cluster %d - stop" % i, color=color)

../_images/city_bike_solution_cluster_start_36_0.png

Four patterns emerge. Small clusters are annoying but let’s show them on a map. The widest one is the one for the week-end.

Graph #

We first need to get 7 clusters for each stations, one per day.

piv = features.pivot_table(index=["station_id","station_name"],
                           columns="weekday", values="cluster")
piv.head()

c:python370_x64libsite-packagespandascoregeneric.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)

	time10
	weekday	0.0	1.0	2.0	3.0	4.0	5.0	6.0
station_id	station_name
2.0	Michigan Ave & Balbo Ave	1.0	7.0	1.0	1.0	1.0	1.0	1.0
3.0	Shedd Aquarium	1.0	1.0	1.0	1.0	1.0	1.0	1.0
4.0	Burnham Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0
5.0	State St & Harrison St	7.0	7.0	7.0	7.0	7.0	1.0	1.0
6.0	Dusable Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0

piv["distincts"] = piv.apply(lambda row: len(set(row[i] for i in range(0,7))), axis=1)

Let’s see which station is classified in more than 4 clusters. NaN means no bikes stopped at this stations. They are mostly unused stations.

piv[piv.distincts >= 4]

	time10								distincts
	weekday	0.0	1.0	2.0	3.0	4.0	5.0	6.0
station_id	station_name
391.0	Halsted St & 69th St	3.0	7.0	1.0	7.0	1.0	1.0	2.0	4
440.0	Lawndale Ave & 23rd St	7.0	7.0	1.0	7.0	2.0	1.0	0.0	4
557.0	Damen Ave & Garfield Blvd	NaN	2.0	1.0	NaN	2.0	NaN	NaN	6
558.0	Ashland Ave & Garfield Blvd	NaN	1.0	1.0	1.0	1.0	2.0	NaN	4
561.0	Damen Ave & 61st St	2.0	7.0	2.0	7.0	NaN	1.0	1.0	4
562.0	Racine Ave & 61st St	NaN	NaN	NaN	NaN	7.0	NaN	NaN	7
564.0	Racine Ave & 65th St	1.0	1.0	NaN	NaN	7.0	1.0	7.0	4
565.0	Ashland Ave & 66th St	1.0	1.0	NaN	1.0	0.0	NaN	5.0	5
567.0	May St & 69th St	6.0	6.0	2.0	2.0	1.0	1.0	NaN	4
568.0	Normal Ave & 72nd St	1.0	1.0	7.0	NaN	7.0	1.0	4.0	4
569.0	Woodlawn Ave & 75th St	1.0	NaN	7.0	1.0	1.0	NaN	1.0	4
576.0	Greenwood Ave & 79th St	7.0	1.0	1.0	NaN	2.0	1.0	1.0	4
581.0	Commercial Ave & 83rd St	1.0	NaN	7.0	NaN	NaN	1.0	1.0	5
582.0	Phillips Ave & 82nd St	NaN	NaN	1.0	1.0	NaN	1.0	7.0	5
586.0	MLK Jr Dr & 83rd St	1.0	2.0	6.0	7.0	7.0	7.0	2.0	4
587.0	Wabash Ave & 83rd St	NaN	NaN	1.0	NaN	1.0	2.0	7.0	6
588.0	South Chicago Ave & 83rd St	NaN	2.0	7.0	3.0	1.0	2.0	7.0	5
593.0	Halsted St & 59th St	NaN	7.0	4.0	4.0	NaN	1.0	1.0	5

pivn = piv.reset_index()
pivn.columns = [' '.join(str(_).replace(".0", "") for _ in col).strip() for col in pivn.columns.values]
pivn.head()

	station_id	station_name	0	1	2	3	4	5	6	distincts
0	2.0	Michigan Ave & Balbo Ave	1.0	7.0	1.0	1.0	1.0	1.0	1.0	2
1	3.0	Shedd Aquarium	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
2	4.0	Burnham Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
3	5.0	State St & Harrison St	7.0	7.0	7.0	7.0	7.0	1.0	1.0	2
4	6.0	Dusable Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1

Let’s draw a map on a week day.

data = stations.merge(pivn, left_on=["id", "name"],
                      right_on=["station_id", "station_name"], suffixes=('_s', '_c'))
data.sort_values("id").head()

	id	name	latitude	longitude	dpcapacity	online_date	station_id	station_name	0	1	2	3	4	5	6	distincts
357	2	Michigan Ave & Balbo Ave	41.872638	-87.623979	35	5/8/2015	2.0	Michigan Ave & Balbo Ave	1.0	7.0	1.0	1.0	1.0	1.0	1.0	2
456	3	Shedd Aquarium	41.867226	-87.615355	31	4/24/2015	3.0	Shedd Aquarium	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
53	4	Burnham Harbor	41.856268	-87.613348	23	5/16/2015	4.0	Burnham Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1
497	5	State St & Harrison St	41.874053	-87.627716	23	6/18/2013	5.0	State St & Harrison St	7.0	7.0	7.0	7.0	7.0	1.0	1.0	2
188	6	Dusable Harbor	41.885042	-87.612795	31	4/24/2015	6.0	Dusable Harbor	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1

from ensae_projects.datainc.data_bikes import folium_html_stations_map

colors = ["red", "yellow", "gray", "green", "brown", "orange", "blue", "black"]
for i, c in enumerate(colors):
    print("Cluster {0} is {1}".format(i, c))
xy = []
for els in data.apply(lambda row: (row["latitude"], row["longitude"], row["1"], row["name"]), axis=1):
    try:
        cl = int(els[2])
    except:
        # NaN
        continue
    name = "%s c%d" % (els[3], cl)
    color = colors[cl]
    xy.append( ( (els[0], els[1]), (name, color)))
folium_html_stations_map(xy, width="80%")

Cluster 0 is red
Cluster 1 is yellow
Cluster 2 is gray
Cluster 3 is green
Cluster 4 is brown
Cluster 5 is orange
Cluster 6 is blue
Cluster 7 is black

Look at the colors close the parks. We notice than people got to the park after work. Let’s see during the week-end.

xy = []
for els in data.apply(lambda row: (row["latitude"], row["longitude"], row["5"], row["name"]), axis=1):
    try:
        cl = int(els[2])
    except:
        # NaN
        continue
    name = "%s c%d" % (els[3], cl)
    color = colors[cl]
    xy.append( ( (els[0], els[1]), (name, color)))
folium_html_stations_map(xy, width="80%")

Links

Contents

Information

Previous topic

Next topic

Bike Pattern 2#

The data #

Normalize, aggregating and merging per start and stop time #

Clustering (stop and start)#

Graph #

Links

Contents

Information

Previous topic

Next topic

Bike Pattern 2#

The data#

Normalize, aggregating and merging per start and stop time#

Clustering (stop and start)#

Graph#

The data #

Normalize, aggregating and merging per start and stop time #

Graph #