module df.dataframe_helpers

Short summary

module pandas_streaming.df.dataframe_helpers

Helpers for dataframes.

source on GitHub

Functions

function

truncated documentation

dataframe_hash_columns

Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.

dataframe_shuffle

Shuffles a dataframe.

dataframe_unfold

One column may contain concatenated values. This function splits these values and multiplies the rows for each split …

hash_float

Hashes a float into a float.

hash_int

Hashes an integer into an integer.

hash_str

Hashes a string.

numpy_types

Returns the list of numpy available types.

pandas_fillna

Replaces the :epkg:`nan` values for something not :epkg:`nan`. Mostly used by pandas_groupby_nan().

pandas_groupby_nan

Does a groupby including keeping missing values (:epkg:`nan`).

Documentation

Helpers for dataframes.

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_hash_columns(df, cols=None, hash_length=10, inplace=False)[source]

Hashes a set of columns in a dataframe. Keeps the same type. Skips missing values.

Parameters
  • df – dataframe

  • cols – columns to hash or None for alls.

  • hash_length – for strings only, length of the hash

  • inplace – modifies inplace

Returns

new dataframe

This might be useful to anonimized data before making it public.

Hashes a set of columns in a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_hash_columns
df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1", ai=1),
                       dict(b="f", c=5.7, ind="a2", ai=2),
                       dict(a=4, b="g", ind="a3", ai=3),
                       dict(a=8, b="h", c=5.9, ai=4),
                       dict(a=16, b="i", c=6.2, ind="a5", ai=5)])
print(df)
print('--------------')
df2 = dataframe_hash_columns(df)
print(df2)

>>>

          a  ai  b    c  ind
    0   1.0   1  e  5.6   a1
    1   NaN   2  f  5.7   a2
    2   4.0   3  g  NaN   a3
    3   8.0   4  h  5.9  NaN
    4  16.0   5  i  6.2   a5
    --------------
                  a        ai           b             c         ind
    0  4.648669e+11  65048080  3f79bb7b43  3.355454e+11  f55ff16f66
    1           NaN   1214325  252f10c836  5.803745e+11  2c3a4249d7
    2  2.750847e+11  80131111  cd0aa98561           NaN  f46dd28a54
    3  1.940968e+11  19167269  aaa9402664  9.635096e+10         NaN
    4  1.083806e+12   8788782  de7d1b721a  3.183198e+11  66220e7159

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_shuffle(df, random_state=None)[source]

Shuffles a dataframe.

Parameters
Returns

new pandas.DataFrame

Shuffles the rows of a dataframe

<<<

import pandas
from pandas_streaming.df import dataframe_shuffle

df = pandas.DataFrame([dict(a=1, b="e", c=5.6, ind="a1"),
                       dict(a=2, b="f", c=5.7, ind="a2"),
                       dict(a=4, b="g", c=5.8, ind="a3"),
                       dict(a=8, b="h", c=5.9, ind="a4"),
                       dict(a=16, b="i", c=6.2, ind="a5")])
print(df)
print('----------')

shuffled = dataframe_shuffle(df, random_state=0)
print(shuffled)

>>>

        a  b    c ind
    0   1  e  5.6  a1
    1   2  f  5.7  a2
    2   4  g  5.8  a3
    3   8  h  5.9  a4
    4  16  i  6.2  a5
    ----------
        a  b    c ind
    2   4  g  5.8  a3
    0   1  e  5.6  a1
    1   2  f  5.7  a2
    3   8  h  5.9  a4
    4  16  i  6.2  a5

source on GitHub

pandas_streaming.df.dataframe_helpers.dataframe_unfold(df, col, new_col=None, sep=', ')[source]

One column may contain concatenated values. This function splits these values and multiplies the rows for each split value.

Parameters
  • df – dataframe

  • col – column with the concatenated values (strings)

  • new_col – new column name, if None, use default value.

  • sep – separator

Returns

a new dataframe

Unfolds a column of a dataframe.

<<<

import pandas
import numpy
from pandas_streaming.df import dataframe_unfold

df = pandas.DataFrame([dict(a=1, b="e,f"),
                       dict(a=2, b="g"),
                       dict(a=3)])
print(df)
df2 = dataframe_unfold(df, "b")
print('----------')
print(df2)

# To fold:
folded = df2.groupby('a').apply(lambda row: ','.join(row['b_unfold'].dropna())
                                if len(row['b_unfold'].dropna()) > 0 else numpy.nan)
print('----------')
print(folded)

>>>

       a    b
    0  1  e,f
    1  2    g
    2  3  NaN
    ----------
       a    b b_unfold
    0  1  e,f        e
    1  1  e,f        f
    2  2    g        g
    3  3  NaN      NaN
    ----------
    a
    1    e,f
    2      g
    3    NaN
    dtype: object

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_float(c, hash_length)[source]

Hashes a float into a float.

Parameters
  • c – value to hash

  • hash_length – hash_length

Returns

int

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_int(c, hash_length)[source]

Hashes an integer into an integer.

Parameters
  • c – value to hash

  • hash_length – hash_length

Returns

int

source on GitHub

pandas_streaming.df.dataframe_helpers.hash_str(c, hash_length)[source]

Hashes a string.

Parameters
  • c – value to hash

  • hash_length – hash_length

Returns

string

source on GitHub

pandas_streaming.df.dataframe_helpers.numpy_types()[source]

Returns the list of numpy available types.

Returns

list of types

source on GitHub

pandas_streaming.df.dataframe_helpers.pandas_fillna(df, by, hasna=None, suffix=None)[source]

Replaces the :epkg:`nan` values for something not :epkg:`nan`. Mostly used by pandas_groupby_nan.

Parameters
  • df – dataframe

  • by – list of columns for which we need to replace nan

  • hasna – None or list of columns for which we need to replace NaN

  • suffix – use a prefix for the NaN value

Returns

list of values chosen for each column, new dataframe (new copy)

source on GitHub

pandas_streaming.df.dataframe_helpers.pandas_groupby_nan(df, by, axis=0, as_index=False, suffix=None, nanback=True, **kwargs)[source]

Does a groupby including keeping missing values (:epkg:`nan`).

Parameters
  • df – dataframe

  • by – column or list of columns

  • axis – only 0 is allowed

  • as_index – should be False

  • suffix – None or a string

  • nanback – put :epkg:`nan` back in the index, otherwise it leaves a replacement for :epkg:`nan`. (does not work when grouping by multiple columns)

  • kwargs – other parameters sent to groupby

Returns

groupby results

See groupby and missing values. If no :epkg:`nan` is detected, the function falls back in regular pandas.DataFrame.groupby which has the following behavior.

Group a dataframe by one column including nan values

The regular pandas.dataframe.GroupBy of a pandas.DataFrame removes every :epkg:`nan` values from the index.

<<<

from pandas import DataFrame

data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"),
        dict(a=3, ind="b"), dict(a=30)]
df = DataFrame(data)
print(df)
gr = df.groupby(["ind"]).sum()
print(gr)

>>>

        a  ind    n
    0   2    a  1.0
    1   2    a  NaN
    2   3    b  NaN
    3  30  NaN  NaN
         a    n
    ind        
    a    4  1.0
    b    3  0.0

Function pandas_groupby_nan modifies keeps them.

<<<

from pandas import DataFrame
from pandas_streaming.df import pandas_groupby_nan

data = [dict(a=2, ind="a", n=1), dict(a=2, ind="a"),
        dict(a=3, ind="b"), dict(a=30)]
df = DataFrame(data)
gr2 = pandas_groupby_nan(df, ["ind"]).sum()
print(gr2)

>>>

       ind   a    n
    0    a   4  1.0
    1    b   3  0.0
    2  NaN  30  0.0

source on GitHub