module `df.dataframe_helpers`¶

Short summary¶

module pandas_streaming.df.dataframe_helpers

Helpers for dataframes.

Functions¶

function	truncated documentation
`dataframe_hash_columns`	Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.
`dataframe_shuffle`	Shuffles a dataframe.
`dataframe_unfold`	One column may contain concatenated values. This function splits these values and multiplies the rows for each split …
`hash_float`	Hashes a float into a float.
`hash_int`	Hashes an integer into an integer.
`hash_str`	Hashes a string.
`numpy_types`	Returns the list of numpy available types.
`pandas_fillna`	Replaces the nan values for something not nan. Mostly used by `pandas_groupby_nan()`.
`pandas_groupby_nan`	Does a groupby including keeping missing values (nan).

Documentation¶

Helpers for dataframes.

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_hash_columns(df, cols=None, hash_length=10, inplace=False)¶

Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.

Parameters:

df – dataframe
cols – columns to hash or None for alls.
hash_length – for strings only, length of the hash
inplace – modifies inplace

Returns:

new dataframe

This might be useful to anonimized data before making it public.

Hashes a set of columns in a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_hash_columns
df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1", ai=1),
                       dict(b="f", c=5.7, ind="a2", ai=2),
                       dict(a=4, b="g", ind="a3", ai=3),
                       dict(a=8, b="h", c=5.9, ai=4),
                       dict(a=16, b="i", c=6.2, ind="a5", ai=5)])
print(df)
print('--------------')
df2 = dataframe_hash_columns(df)
print(df2)

>>>

          a  b    c  ind  ai
 1.0  e  5.6   a1   1
 NaN  f  5.7   a2   2
 4.0  g  NaN   a3   3
 8.0  h  5.9  NaN   4
16.0  i  6.2   a5   5
    --------------
                  a           b             c         ind        ai
4.648669e+11  3f79bb7b43  3.355454e+11  f55ff16f66  65048080
         NaN  252f10c836  5.803745e+11  2c3a4249d7   1214325
2.750847e+11  cd0aa98561           NaN  f46dd28a54  80131111
1.940968e+11  aaa9402664  9.635096e+10         NaN  19167269
1.083806e+12  de7d1b721a  3.183198e+11  66220e7159   8788782

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_shuffle(df, random_state=None)¶

Shuffles a dataframe.

Parameters:

df – pandas.DataFrame
random_state – seed

Returns:

new pandas.DataFrame

Shuffles the rows of a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_shuffle

df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1"),
                       dict(a=2, b="f", c=5.7, ind="a2"),
                       dict(a=4, b="g", c=5.8, ind="a3"),
                       dict(a=8, b="h", c=5.9, ind="a4"),
                       dict(a=16, b="i", c=6.2, ind="a5")])
print(df)
print('----------')

shuffled = dataframe_shuffle(df, random_state=0)
print(shuffled)

>>>

        a  b    c ind
 1  e  5.6  a1
 2  f  5.7  a2
 4  g  5.8  a3
 8  h  5.9  a4
16  i  6.2  a5
    ----------
        a  b    c ind
 4  g  5.8  a3
 1  e  5.6  a1
 2  f  5.7  a2
 8  h  5.9  a4
16  i  6.2  a5

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_unfold(df, col, new_col=None, sep=',')¶

One column may contain concatenated values. This function splits these values and multiplies the rows for each split value.

Parameters:

df – dataframe
col – column with the concatenated values (strings)
new_col – new column name, if None, use default value.
sep – separator

Returns:

a new dataframe

Unfolds a column of a dataframe.

<<<

import pandas
import numpy
from pandas_streaming.df import dataframe_unfold

df = pandas.DataFrame([dict(a=1, b="e,f"),
                       dict(a=2, b="g"),
                       dict(a=3)])
print(df)
df2 = dataframe_unfold(df, "b")
print('----------')
print(df2)

# To fold:
folded = df2.groupby('a').apply(lambda row: ','.join(row['b_unfold'].dropna())
                                if len(row['b_unfold'].dropna()) > 0 else numpy.nan)
print('----------')
print(folded)

>>>

       a    b
    0  1  e,f
    1  2    g
    2  3  NaN
    ----------
       a    b b_unfold
    0  1  e,f        e
    1  1  e,f        f
    2  2    g        g
    3  3  NaN      NaN
    ----------
    a
    1    e,f
    2      g
    3    NaN
    dtype: object

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_float(c, hash_length)¶

Hashes a float into a float.

Parameters:

c – value to hash
hash_length – hash_length

Returns:

int

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_int(c, hash_length)¶

Hashes an integer into an integer.

Parameters:

c – value to hash
hash_length – hash_length

Returns:

int

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_str(c, hash_length)¶

Hashes a string.

Parameters:

c – value to hash
hash_length – hash_length

Returns:

string

source on GitHub

pandas_streaming.df.dataframe_helpers.numpy_types()¶

Returns the list of numpy available types.

Returns:: list of types

source on GitHub

pandas_streaming.df.dataframe_helpers.pandas_fillna(df, by, hasna=None, suffix=None)¶

Replaces the nan values for something not nan. Mostly used by pandas_groupby_nan.

Parameters:

df – dataframe
by – list of columns for which we need to replace nan
hasna – None or list of columns for which we need to replace NaN
suffix – use a prefix for the NaN value

Returns:

list of values chosen for each column, new dataframe (new copy)

source on GitHub

pandas_streaming.df.dataframe_helpers.pandas_groupby_nan(df, by, axis=0, as_index=False, suffix=None, nanback=True, **kwargs)¶

Does a groupby including keeping missing values (nan).

Parameters:

df – dataframe
by – column or list of columns
axis – only 0 is allowed
as_index – should be False
suffix – None or a string
nanback – put nan back in the index, otherwise it leaves a replacement for nan. (does not work when grouping by multiple columns)
kwargs – other parameters sent to groupby

Returns:

groupby results

See groupby and missing values. If no nan is detected, the function falls back in regular pandas.DataFrame.groupby which has the following behavior.

Group a dataframe by one column including nan values

The regular pandas.dataframe.GroupBy of a pandas.DataFrame removes every nan values from the index.

<<<

from pandas import DataFrame

data = [dict(a=2, ind="a", n=1),
        dict(a=2, ind="a"),
        dict(a=3, ind="b"),
        dict(a=30)]
df = DataFrame(data)
print(df)
gr = df.groupby(["ind"]).sum()
print(gr)

>>>

        a  ind    n
    0   2    a  1.0
    1   2    a  NaN
    2   3    b  NaN
    3  30  NaN  NaN
         a    n
    ind        
    a    4  1.0
    b    3  0.0

Function pandas_groupby_nan modifies keeps them.

<<<

from pandas import DataFrame
from pandas_streaming.df import pandas_groupby_nan

data = [dict(a=2, ind="a", n=1),
        dict(a=2, ind="a"),
        dict(a=3, ind="b"),
        dict(a=30)]
df = DataFrame(data)
gr2 = pandas_groupby_nan(df, ["ind"]).sum()
print(gr2)

>>>

       ind   a    n
  a   4  1.0
  b   3  0.0
NaN  30  0.0

source on GitHub

module `df.dataframe_helpers`¶

Short summary¶

Functions¶

Documentation¶

pandas_streaming

Navigation

Related Topics

module df.dataframe_helpers¶

Short summary¶

Functions¶

Documentation¶

module `df.dataframe_helpers`¶