module `df.dataframe_split`¶

Short summary¶

module pandas_streaming.df.dataframe_split

Implements different methods to split a dataframe.

Functions¶

function	truncated documentation
`sklearn_train_test_split`	Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies …
`sklearn_train_test_split_streaming`	Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies …

Documentation¶

Implements different methods to split a dataframe.

source on GitHub

pandas_streaming.df.dataframe_split.sklearn_train_test_split(self, path_or_buf=None, export_method='to_csv', names=None, **kwargs)¶

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on sklearn.model_selection.train_test_split. It does not handle stratified version of it.

Parameters:

self – StreamingDataFrame
path_or_buf – a string, a list of strings or buffers, if it is a string, it must contain {} like partition{}.txt
export_method – method used to store the partitions, by default pandas.DataFrame.to_csv
names – partitions names, by default ('train', 'test')
kwargs – parameters for the export function and sklearn.model_selection.train_test_split.

Returns:

outputs of the exports functions

The function cannot return two iterators or two StreamingDataFrame because running through one means running through the other. We can assume both splits do not hold in memory and we cannot run through the same iterator again as random draws would be different. We need to store the results into files or buffers.

Warning

The method export_method must write the data in mode append and allows stream.

source on GitHub

pandas_streaming.df.dataframe_split.sklearn_train_test_split_streaming(self, test_size=0.25, train_size=None, stratify=None, hash_size=9, unique_rows=False)¶

Randomly splits a dataframe into smaller pieces. The function returns streams of file names. The function relies on sklearn.model_selection.train_test_split. It handles the stratified version of it.

Parameters:

self – StreamingDataFrame
test_size – ratio for the test partition (if train_size is not specified)
train_size – ratio for the train partition
stratify – column holding the stratification
hash_size – size of the hash to cache information about partition
unique_rows – ensures that rows are unique

Returns:

Two StreamingDataFrame, one for train, one for test.

The function returns two iterators or two StreamingDataFrame. It tries to do everything without writing anything on disk but it requires to store the repartition somehow. This function hashes every row and maps the hash with a part (train or test). This cache must hold in memory otherwise the function fails. The two returned iterators must not be used for the first time in the same time. The first time is used to build the cache. The function changes the order of rows if the parameter stratify is not null. The cache has a side effect: every exact same row will be put in the same partition. If that is not what you want, you should add an index column or a random one.

source on GitHub

module `df.dataframe_split`¶

Short summary¶

Functions¶

Documentation¶

pandas_streaming

Navigation

Related Topics

module df.dataframe_split¶

Short summary¶

Functions¶

Documentation¶

module `df.dataframe_split`¶