XD blog

blog page

pandas, python

2017-11-05 streaming pandas dataframe

pandas is widely used by data scientists. It is one of the modules which contributed a lot to the Python ecosystem to manipulate data. It is not perfect, a dataset takes in memory three times the space it takes on disk in average and reading a couple of gigabytes is necessarily fast. However, a couple of gigabytes is not enough to think about stronger approaches such parallelization (dask, ...) but with a little bit of overhead for such size. All I wanted was the same functionalities as pandas but implemented in a streaming way. No need to load the whole datasets in memory, no need to wait for the data to be fully loaded in memory. That's why I started pandas_streaming.

import pandas
df = pandas.DataFrame([dict(cf=0, cint=0, cstr="0"),
                       dict(cf=1, cint=1, cstr="1"),
                       dict(cf=3, cint=3, cstr="3")])

from pandas_streaming.df import StreamingDataFrame
sdf = StreamingDataFrame.read_df(df)

for df in sdf:
    # process this chunk of data
    # df is a dataframe

The module will continue to grow probably not as fast as I would like it to.

<-- -->

Xavier Dupré