.. _firststepsrst: ================================= First steps with pandas_streaming ================================= .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/first_steps.ipynb|*` A few difference between `pandas `__ and *pandas_streaming*. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: pandas to pandas_streaming -------------------------- .. code:: ipython3 from pandas import DataFrame df = DataFrame(data=dict(X=[4.5, 6, 7], Y=["a", "b", "c"])) df .. raw:: html
X Y
0 4.5 a
1 6.0 b
2 7.0 c
We create a streaming dataframe: .. code:: ipython3 from pandas_streaming.df import StreamingDataFrame sdf = StreamingDataFrame.read_df(df) sdf .. parsed-literal:: .. code:: ipython3 sdf.to_dataframe() .. raw:: html
X Y
0 4.5 a
1 6.0 b
2 7.0 c
Internally, StreamingDataFrame implements an iterator on dataframes and then tries to replicate the same interface as `pandas.DataFrame `__ possibly wherever it is possible to manipulate data without loading everything into memory. .. code:: ipython3 sdf2 = sdf.concat(sdf) sdf2.to_dataframe() .. raw:: html
X Y
0 4.5 a
1 6.0 b
2 7.0 c
0 4.5 a
1 6.0 b
2 7.0 c
.. code:: ipython3 m = DataFrame(dict(Y=["a", "b"], Z=[10, 20])) m .. raw:: html
Y Z
0 a 10
1 b 20
.. code:: ipython3 sdf3 = sdf2.merge(m, left_on="Y", right_on="Y", how="outer") sdf3.to_dataframe() .. raw:: html
X Y Z
0 4.5 a 10.0
1 6.0 b 20.0
2 7.0 c NaN
0 4.5 a 10.0
1 6.0 b 20.0
2 7.0 c NaN
.. code:: ipython3 sdf2.to_dataframe().merge(m, left_on="Y", right_on="Y", how="outer") .. raw:: html
X Y Z
0 4.5 a 10.0
1 4.5 a 10.0
2 6.0 b 20.0
3 6.0 b 20.0
4 7.0 c NaN
5 7.0 c NaN
The order might be different. .. code:: ipython3 sdftr, sdfte = sdf2.train_test_split(test_size=0.5) sdfte.head() .. raw:: html
X Y
0 4.5 a
1 4.5 a
.. code:: ipython3 sdftr.head() .. raw:: html
X Y
0 6.0 b
1 7.0 c
2 6.0 b
0 7.0 c
split a big file ---------------- .. code:: ipython3 sdf2.to_csv("example.txt") .. parsed-literal:: 'example.txt' .. code:: ipython3 new_sdf = StreamingDataFrame.read_csv("example.txt") new_sdf.train_test_split("example.{}.txt", streaming=False) .. parsed-literal:: ['example.train.txt', 'example.test.txt'] .. code:: ipython3 import glob glob.glob("ex*.txt") .. parsed-literal:: ['example.test.txt', 'example.train.txt', 'example.txt']