Bike Pattern 2

Links: notebook, html, PDF, python, slides, slides(2), GitHub

We used a little bit of machine learning on Divvy Data to dig into a better division of Chicago. We try to identify patterns among bike stations.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline

The data

Divvy Data publishes a sample of the data.

from pyensae.datasource import download_data
file = download_data("Divvy_Trips_2016_Q3Q4.zip", url="https://s3.amazonaws.com/divvy-data/tripdata/")

We know the stations.

import pandas
stations = pandas.read_csv("Divvy_Stations_2016_Q3.csv")
bikes = pandas.concat([pandas.read_csv("Divvy_Trips_2016_Q3.csv"),
                       pandas.read_csv("Divvy_Trips_2016_Q4.csv")])
bikes.head()
trip_id starttime stoptime bikeid tripduration from_station_id from_station_name to_station_id to_station_name usertype gender birthyear
0 12150160 9/30/2016 23:59:58 10/1/2016 00:04:03 4959 245 69 Damen Ave & Pierce Ave 17 Wood St & Division St Subscriber Male 1988.0
1 12150159 9/30/2016 23:59:58 10/1/2016 00:04:09 2589 251 383 Ashland Ave & Harrison St 320 Loomis St & Lexington St Subscriber Female 1990.0
2 12150158 9/30/2016 23:59:51 10/1/2016 00:24:51 3656 1500 302 Sheffield Ave & Wrightwood Ave 334 Lake Shore Dr & Belmont Ave Customer NaN NaN
3 12150157 9/30/2016 23:59:51 10/1/2016 00:03:56 3570 245 475 Washtenaw Ave & Lawrence Ave 471 Francisco Ave & Foster Ave Subscriber Female 1988.0
4 12150156 9/30/2016 23:59:32 10/1/2016 00:26:50 3158 1638 302 Sheffield Ave & Wrightwood Ave 492 Leavitt St & Addison St Customer NaN NaN
from datetime import datetime, time
df = bikes
df["dtstart"] = pandas.to_datetime(df.starttime, infer_datetime_format=True)
df["dtstop"] = pandas.to_datetime(df.stoptime, infer_datetime_format=True)

df["stopday"] = df.dtstop.apply(lambda r: datetime(r.year, r.month, r.day))
df["stoptime"] = df.dtstop.apply(lambda r: time(r.hour, r.minute, 0))
df["stoptime10"] = df.dtstop.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0))  # every 10 minutes

df["startday"] = df.dtstart.apply(lambda r: datetime(r.year, r.month, r.day))
df["starttime"] = df.dtstart.apply(lambda r: time(r.hour, r.minute, 0))
df["starttime10"] = df.dtstart.apply(lambda r: time(r.hour, (r.minute // 10)*10, 0))  # every 10 minutes
df['stopweekday'] = df['dtstop'].dt.dayofweek
df['startweekday'] = df['dtstart'].dt.dayofweek

Normalize, aggregating and merging per start and stop time

key = ["to_station_id", "to_station_name", "stopweekday", "stoptime10"]
keep = key + ["trip_id"]

stopaggtime = df[keep].groupby(key, as_index=False).count()
stopaggtime.columns = key + ["nb_trips"]

stopaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
stopaggday.columns = key[:-1] + ["nb_trips"]

stopaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
stopaggday.columns = key[:-1] + ["nb_trips"]

stopmerge = stopaggtime.merge(stopaggday, on=key[:-1], suffixes=("", "day"))
stopmerge["stopdist"] = stopmerge["nb_trips"] / stopmerge["nb_tripsday"]
stopmerge.head()
to_station_id to_station_name stopweekday stoptime10 nb_trips nb_tripsday stopdist
0 2 Michigan Ave & Balbo Ave 0 00:10:00 2 913 0.002191
1 2 Michigan Ave & Balbo Ave 0 00:20:00 2 913 0.002191
2 2 Michigan Ave & Balbo Ave 0 00:30:00 2 913 0.002191
3 2 Michigan Ave & Balbo Ave 0 01:00:00 3 913 0.003286
4 2 Michigan Ave & Balbo Ave 0 01:10:00 2 913 0.002191
stopmerge[stopmerge["to_station_id"] == 2] \
       .plot(x="stoptime10", y="stopdist", figsize=(14,4), kind="area");
../_images/city_bike_solution_cluster_start_12_0.png
key = ["from_station_id", "from_station_name", "startweekday", "starttime10"]
keep = key + ["trip_id"]

startaggtime = df[keep].groupby(key, as_index=False).count()
startaggtime.columns = key + ["nb_trips"]

startaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
startaggday.columns = key[:-1] + ["nb_trips"]

startaggday = df[keep[:-2] + ["trip_id"]].groupby(key[:-1], as_index=False).count()
startaggday.columns = key[:-1] + ["nb_trips"]

startmerge = startaggtime.merge(startaggday, on=key[:-1], suffixes=("", "day"))
startmerge["startdist"] = startmerge["nb_trips"] / startmerge["nb_tripsday"]
startmerge.head()
from_station_id from_station_name startweekday starttime10 nb_trips nb_tripsday startdist
0 2 Michigan Ave & Balbo Ave 0 00:00:00 4 1065 0.003756
1 2 Michigan Ave & Balbo Ave 0 00:10:00 1 1065 0.000939
2 2 Michigan Ave & Balbo Ave 0 00:20:00 3 1065 0.002817
3 2 Michigan Ave & Balbo Ave 0 00:50:00 4 1065 0.003756
4 2 Michigan Ave & Balbo Ave 0 01:10:00 3 1065 0.002817
startmerge[startmerge["from_station_id"] == 2] \
         .plot(x="starttime10",  y="startdist", figsize=(14,4), kind="area");
../_images/city_bike_solution_cluster_start_14_0.png
everything = stopmerge.merge(startmerge,
                             left_on=["to_station_id", "to_station_name", "stopweekday", "stoptime10"],
                             right_on=["from_station_id", "from_station_name","startweekday", "starttime10"],
                             suffixes=("stop", "start"),
                             how="outer")
everything.head()
to_station_id to_station_name stopweekday stoptime10 nb_tripsstop nb_tripsdaystop stopdist from_station_id from_station_name startweekday starttime10 nb_tripsstart nb_tripsdaystart startdist
0 2.0 Michigan Ave & Balbo Ave 0.0 00:10:00 2.0 913.0 0.002191 2.0 Michigan Ave & Balbo Ave 0.0 00:10:00 1.0 1065.0 0.000939
1 2.0 Michigan Ave & Balbo Ave 0.0 00:20:00 2.0 913.0 0.002191 2.0 Michigan Ave & Balbo Ave 0.0 00:20:00 3.0 1065.0 0.002817
2 2.0 Michigan Ave & Balbo Ave 0.0 00:30:00 2.0 913.0 0.002191 NaN NaN NaN NaN NaN NaN NaN
3 2.0 Michigan Ave & Balbo Ave 0.0 01:00:00 3.0 913.0 0.003286 NaN NaN NaN NaN NaN NaN NaN
4 2.0 Michigan Ave & Balbo Ave 0.0 01:10:00 2.0 913.0 0.002191 2.0 Michigan Ave & Balbo Ave 0.0 01:10:00 3.0 1065.0 0.002817
import numpy
from datetime import datetime

def bestof(x, y):
    if isinstance(x, (datetime, time, str)):
        return x
    try:
        if x is None or isinstance(y, (datetime, time, str)) or numpy.isnan(x):
            return y
        else:
            return x
    except:
        print(type(x), type(y))
        print(x, y)
        raise

bestof(datetime(2017,2,2), numpy.nan), bestof(numpy.nan, datetime(2017,2,2))
(datetime.datetime(2017, 2, 2, 0, 0), datetime.datetime(2017, 2, 2, 0, 0))
every = everything.copy()
every["station_name"] = every.apply(lambda row: bestof(row["to_station_name"], row["from_station_name"]), axis=1)
every["station_id"] = every.apply(lambda row: bestof(row["to_station_id"], row["from_station_id"]), axis=1)
every["time10"] = every.apply(lambda row: bestof(row["stoptime10"], row["starttime10"]), axis=1)
every["weekday"] = every.apply(lambda row: bestof(row["stopweekday"], row["startweekday"]), axis=1)
every = every.drop(["stoptime10", "starttime10", "stopweekday", "startweekday",
                    "to_station_id", "from_station_id",
                    "to_station_name", "from_station_name"], axis=1)
every.head()
nb_tripsstop nb_tripsdaystop stopdist nb_tripsstart nb_tripsdaystart startdist station_name station_id time10 weekday
0 2.0 913.0 0.002191 1.0 1065.0 0.000939 Michigan Ave & Balbo Ave 2.0 00:10:00 0.0
1 2.0 913.0 0.002191 3.0 1065.0 0.002817 Michigan Ave & Balbo Ave 2.0 00:20:00 0.0
2 2.0 913.0 0.002191 NaN NaN NaN Michigan Ave & Balbo Ave 2.0 00:30:00 0.0
3 3.0 913.0 0.003286 NaN NaN NaN Michigan Ave & Balbo Ave 2.0 01:00:00 0.0
4 2.0 913.0 0.002191 3.0 1065.0 0.002817 Michigan Ave & Balbo Ave 2.0 01:10:00 0.0
every.shape, stopmerge.shape, startmerge.shape
((357809, 10), (298013, 7), (299700, 7))

We need vectors of equal size which means filling NaN values with 0 and adding times when not present.

every.columns
Index(['nb_tripsstop', 'nb_tripsdaystop', 'stopdist', 'nb_tripsstart',
       'nb_tripsdaystart', 'startdist', 'station_name', 'station_id', 'time10',
       'weekday'],
      dtype='object')
keys = ['station_name', 'station_id', 'weekday', 'time10']
for c in every.columns:
    if c not in keys:
        every[c].fillna(0)
from ensae_projects.datainc.data_bikes import add_missing_time
full = df = add_missing_time(every, delay=10,  column="time10",
                             values=[c for c in every.columns if c not in keys])
full = full[['station_name', 'station_id', 'time10', 'weekday',
             'stopdist', 'startdist',
             'nb_tripsstop', 'nb_tripsdaystop', 'nb_tripsstart',
       'nb_tripsdaystart']].sort_values(keys)
full.head()
station_name station_id time10 weekday stopdist startdist nb_tripsstop nb_tripsdaystop nb_tripsstart nb_tripsdaystart
357809 2112 W Peterson Ave 456.0 00:00:00 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
357810 2112 W Peterson Ave 456.0 00:10:00 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
357811 2112 W Peterson Ave 456.0 00:20:00 0.0 0.0 0.000000 0.0 0.0 0.0 0.0
341443 2112 W Peterson Ave 456.0 00:30:00 0.0 0.0 0.021277 0.0 0.0 1.0 47.0
357812 2112 W Peterson Ave 456.0 00:40:00 0.0 0.0 0.000000 0.0 0.0 0.0 0.0

Clustering (stop and start)

We cluster these distribution to find some patterns. But we need vectors of equal size which should be equal to 24*6.

This is much better.

df = full
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2, 1, figsize=(14,6))
df[df["station_id"] == 2].plot(x="time10", y="startdist", figsize=(14,4), kind="area", ax=ax[0])
df[df["station_id"] == 2].plot(x="time10", y="stopdist", figsize=(14,4),
                               kind="area", ax=ax[1], color="r");
../_images/city_bike_solution_cluster_start_25_0.png

Let’s build the features.

features = df.pivot_table(index=["station_id", "station_name", "weekday"],
                          columns="time10", values=["startdist", "stopdist"]).reset_index()

features.head()
station_id station_name weekday startdist ... stopdist
time10 00:00:00 00:10:00 00:20:00 00:30:00 00:40:00 00:50:00 01:00:00 ... 22:20:00 22:30:00 22:40:00 22:50:00 23:00:00 23:10:00 23:20:00 23:30:00 23:40:00 23:50:00
0 2.0 Michigan Ave & Balbo Ave 0.0 0.003756 0.000939 0.002817 0.000000 0.000000 0.003756 0.000000 ... 0.004381 0.002191 0.004381 0.002191 0.004381 0.004381 0.005476 0.002191 0.000000 0.005476
1 2.0 Michigan Ave & Balbo Ave 1.0 0.000000 0.000000 0.001106 0.001106 0.001106 0.002212 0.000000 ... 0.009371 0.012048 0.006693 0.004016 0.005355 0.006693 0.002677 0.000000 0.000000 0.000000
2 2.0 Michigan Ave & Balbo Ave 2.0 0.001357 0.002714 0.000000 0.001357 0.000000 0.005427 0.000000 ... 0.002907 0.002907 0.015988 0.005814 0.001453 0.001453 0.011628 0.000000 0.000000 0.007267
3 2.0 Michigan Ave & Balbo Ave 3.0 0.000000 0.004144 0.000000 0.000000 0.002762 0.004144 0.000000 ... 0.009274 0.003091 0.003091 0.007728 0.001546 0.003091 0.009274 0.001546 0.007728 0.001546
4 2.0 Michigan Ave & Balbo Ave 4.0 0.000000 0.000000 0.000000 0.002846 0.000000 0.000000 0.000949 ... 0.008214 0.001027 0.006160 0.004107 0.015400 0.006160 0.002053 0.006160 0.007187 0.000000

5 rows × 291 columns

names = features.columns[3:]
len(names)
288
from sklearn.cluster import KMeans
clus = KMeans(8)
clus.fit(features[names])
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
pred = clus.predict(features[names])
set(pred)
{0, 1, 2, 3, 4, 5, 6, 7}
features["cluster"] = pred

Let’s see what it means accross day. We need to look whether or not a cluster is related to day of the working week or the week end.

features[["cluster", "weekday", "station_id"]].groupby(["cluster", "weekday"]).count()
c:python370_x64libsite-packagespandascoregeneric.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)
station_id
time10
cluster weekday
0 3.0 1
4.0 1
6.0 1
1 0.0 146
1.0 110
2.0 106
3.0 119
4.0 143
5.0 553
6.0 547
2 0.0 137
1.0 141
2.0 150
3.0 147
4.0 149
5.0 8
6.0 12
3 0.0 1
3.0 1
4 2.0 1
3.0 1
6.0 1
5 6.0 1
6 0.0 1
1.0 1
2.0 1
6.0 1
7 0.0 291
1.0 326
2.0 322
3.0 308
4.0 287
5.0 19
6.0 17
nb = features[["cluster", "weekday", "station_id"]].groupby(["cluster", "weekday"]).count()
nb = nb.reset_index()
nb[nb.cluster.isin([0, 3, 5, 6])].pivot("weekday","cluster", "station_id").plot(kind="bar");
c:python370_x64libsite-packagespandascoregeneric.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)
../_images/city_bike_solution_cluster_start_34_1.png

Let’s draw the clusters.

centers = clus.cluster_centers_.T
import matplotlib.pyplot as plt
fig, ax = plt.subplots(centers.shape[1], 2, figsize=(10,10))
nbf = centers.shape[0] // 2
x = list(range(0,nbf))
col = 0
dec = 0
colors = ["red", "yellow", "gray", "green", "brown", "orange", "blue"]
for i in range(centers.shape[1]):
    if 2*i == centers.shape[1]:
        col += 1
        dec += centers.shape[1]
    color = colors[i%len(colors)]
    ax[2*i-dec, col].bar (x, centers[:nbf,i], width=1.0, color=color)
    ax[2*i-dec, col].set_ylabel("cluster %d - start" % i, color=color)
    ax[2*i+1-dec, col].bar (x, centers[nbf:,i], width=1.0, color=color)
    ax[2*i+1-dec, col].set_ylabel("cluster %d - stop" % i, color=color)
../_images/city_bike_solution_cluster_start_36_0.png

Four patterns emerge. Small clusters are annoying but let’s show them on a map. The widest one is the one for the week-end.

Graph

We first need to get 7 clusters for each stations, one per day.

piv = features.pivot_table(index=["station_id","station_name"],
                           columns="weekday", values="cluster")
piv.head()
c:python370_x64libsite-packagespandascoregeneric.py:3111: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  obj = obj._drop_axis(labels, axis, level=level, errors=errors)
time10
weekday 0.0 1.0 2.0 3.0 4.0 5.0 6.0
station_id station_name
2.0 Michigan Ave & Balbo Ave 1.0 7.0 1.0 1.0 1.0 1.0 1.0
3.0 Shedd Aquarium 1.0 1.0 1.0 1.0 1.0 1.0 1.0
4.0 Burnham Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0
5.0 State St & Harrison St 7.0 7.0 7.0 7.0 7.0 1.0 1.0
6.0 Dusable Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0
piv["distincts"] = piv.apply(lambda row: len(set(row[i] for i in range(0,7))), axis=1)

Let’s see which station is classified in more than 4 clusters. NaN means no bikes stopped at this stations. They are mostly unused stations.

piv[piv.distincts >= 4]
time10 distincts
weekday 0.0 1.0 2.0 3.0 4.0 5.0 6.0
station_id station_name
391.0 Halsted St & 69th St 3.0 7.0 1.0 7.0 1.0 1.0 2.0 4
440.0 Lawndale Ave & 23rd St 7.0 7.0 1.0 7.0 2.0 1.0 0.0 4
557.0 Damen Ave & Garfield Blvd NaN 2.0 1.0 NaN 2.0 NaN NaN 6
558.0 Ashland Ave & Garfield Blvd NaN 1.0 1.0 1.0 1.0 2.0 NaN 4
561.0 Damen Ave & 61st St 2.0 7.0 2.0 7.0 NaN 1.0 1.0 4
562.0 Racine Ave & 61st St NaN NaN NaN NaN 7.0 NaN NaN 7
564.0 Racine Ave & 65th St 1.0 1.0 NaN NaN 7.0 1.0 7.0 4
565.0 Ashland Ave & 66th St 1.0 1.0 NaN 1.0 0.0 NaN 5.0 5
567.0 May St & 69th St 6.0 6.0 2.0 2.0 1.0 1.0 NaN 4
568.0 Normal Ave & 72nd St 1.0 1.0 7.0 NaN 7.0 1.0 4.0 4
569.0 Woodlawn Ave & 75th St 1.0 NaN 7.0 1.0 1.0 NaN 1.0 4
576.0 Greenwood Ave & 79th St 7.0 1.0 1.0 NaN 2.0 1.0 1.0 4
581.0 Commercial Ave & 83rd St 1.0 NaN 7.0 NaN NaN 1.0 1.0 5
582.0 Phillips Ave & 82nd St NaN NaN 1.0 1.0 NaN 1.0 7.0 5
586.0 MLK Jr Dr & 83rd St 1.0 2.0 6.0 7.0 7.0 7.0 2.0 4
587.0 Wabash Ave & 83rd St NaN NaN 1.0 NaN 1.0 2.0 7.0 6
588.0 South Chicago Ave & 83rd St NaN 2.0 7.0 3.0 1.0 2.0 7.0 5
593.0 Halsted St & 59th St NaN 7.0 4.0 4.0 NaN 1.0 1.0 5
pivn = piv.reset_index()
pivn.columns = [' '.join(str(_).replace(".0", "") for _ in col).strip() for col in pivn.columns.values]
pivn.head()
station_id station_name 0 1 2 3 4 5 6 distincts
0 2.0 Michigan Ave & Balbo Ave 1.0 7.0 1.0 1.0 1.0 1.0 1.0 2
1 3.0 Shedd Aquarium 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1
2 4.0 Burnham Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1
3 5.0 State St & Harrison St 7.0 7.0 7.0 7.0 7.0 1.0 1.0 2
4 6.0 Dusable Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1

Let’s draw a map on a week day.

data = stations.merge(pivn, left_on=["id", "name"],
                      right_on=["station_id", "station_name"], suffixes=('_s', '_c'))
data.sort_values("id").head()
id name latitude longitude dpcapacity online_date station_id station_name 0 1 2 3 4 5 6 distincts
357 2 Michigan Ave & Balbo Ave 41.872638 -87.623979 35 5/8/2015 2.0 Michigan Ave & Balbo Ave 1.0 7.0 1.0 1.0 1.0 1.0 1.0 2
456 3 Shedd Aquarium 41.867226 -87.615355 31 4/24/2015 3.0 Shedd Aquarium 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1
53 4 Burnham Harbor 41.856268 -87.613348 23 5/16/2015 4.0 Burnham Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1
497 5 State St & Harrison St 41.874053 -87.627716 23 6/18/2013 5.0 State St & Harrison St 7.0 7.0 7.0 7.0 7.0 1.0 1.0 2
188 6 Dusable Harbor 41.885042 -87.612795 31 4/24/2015 6.0 Dusable Harbor 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1
from ensae_projects.datainc.data_bikes import folium_html_stations_map

colors = ["red", "yellow", "gray", "green", "brown", "orange", "blue", "black"]
for i, c in enumerate(colors):
    print("Cluster {0} is {1}".format(i, c))
xy = []
for els in data.apply(lambda row: (row["latitude"], row["longitude"], row["1"], row["name"]), axis=1):
    try:
        cl = int(els[2])
    except:
        # NaN
        continue
    name = "%s c%d" % (els[3], cl)
    color = colors[cl]
    xy.append( ( (els[0], els[1]), (name, color)))
folium_html_stations_map(xy, width="80%")
Cluster 0 is red
Cluster 1 is yellow
Cluster 2 is gray
Cluster 3 is green
Cluster 4 is brown
Cluster 5 is orange
Cluster 6 is blue
Cluster 7 is black

Look at the colors close the parks. We notice than people got to the park after work. Let’s see during the week-end.

xy = []
for els in data.apply(lambda row: (row["latitude"], row["longitude"], row["5"], row["name"]), axis=1):
    try:
        cl = int(els[2])
    except:
        # NaN
        continue
    name = "%s c%d" % (els[3], cl)
    color = colors[cl]
    xy.append( ( (els[0], els[1]), (name, color)))
folium_html_stations_map(xy, width="80%")