module mlmodel.categories_to_integers

Inheritance diagram of mlinsights.mlmodel.categories_to_integers

Short summary

module mlinsights.mlmodel.categories_to_integers

Implements a transformation which can be put in a pipeline to transform categories in integers.

source on GitHub

Classes

class

truncated documentation

CategoriesToIntegers

Does something similar to what DictVectorizer

Methods

method

truncated documentation

__init__

__str__

usual

_build_schema

Concatenates all the categories given the information stored in _categories.

fit

Makes the list of all categories in input X. X must be a dataframe. Parameters ———- …

fit_transform

Fits and transforms categories in numerical features based on the list of categories found by method fit. …

transform

Transforms categories in numerical features based on the list of categories found by method fit. X must …

Documentation

Implements a transformation which can be put in a pipeline to transform categories in integers.

source on GitHub

class mlinsights.mlmodel.categories_to_integers.CategoriesToIntegers(columns=None, remove=None, skip_errors=False, single=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Does something similar to what DictVectorizer does but in a transformer. The method fit retains all categories, the method transform transforms categories into integers. Categories are sorted by columns. If the method transform tries to convert a categories which was not seen by method fit, it can raise an exception or ignore it and replace it by zero.

DictVectorizer or CategoriesToIntegers

Example which transforms text into integers:

<<<

import pandas
from mlinsights.mlmodel import CategoriesToIntegers
df = pandas.DataFrame([{"cat": "a"}, {"cat": "b"}])
trans = CategoriesToIntegers()
trans.fit(df)
newdf = trans.transform(df)
print(newdf)

>>>

       cat=a  cat=b
    0    1.0    NaN
    1    NaN    1.0

source on GitHub

Parameters
  • columns – specify a columns selection

  • remove – modalities to remove

  • skip_errors – skip when a new categories appear (no 1)

  • single – use a single column per category, do not multiply them for each value

The logging function displays a message when a new dense and big matrix is created when it should be sparse. A sparse matrix should be allocated instead.

source on GitHub

__init__(columns=None, remove=None, skip_errors=False, single=False)[source]
Parameters
  • columns – specify a columns selection

  • remove – modalities to remove

  • skip_errors – skip when a new categories appear (no 1)

  • single – use a single column per category, do not multiply them for each value

The logging function displays a message when a new dense and big matrix is created when it should be sparse. A sparse matrix should be allocated instead.

source on GitHub

__str__()[source]

usual

source on GitHub

_build_schema()[source]

Concatenates all the categories given the information stored in _categories.

Returns

list of columns, beginning of each

source on GitHub

fit(X, y=None, **fit_params)[source]

Makes the list of all categories in input X. X must be a dataframe.

Parameters
  • X (iterable) – Training data

  • y (iterable, default=None) – Training targets.

Returns

Return type

self

source on GitHub

fit_transform(X, y=None, **fit_params)[source]

Fits and transforms categories in numerical features based on the list of categories found by method fit. X must be a dataframe. The function does not preserve the order of the columns.

Parameters
  • X (iterable) – Training data

  • y (iterable, default=None) – Training targets.

Returns

Return type

Dataframe, X with categories.

source on GitHub

transform(X, y=None, **fit_params)[source]

Transforms categories in numerical features based on the list of categories found by method fit. X must be a dataframe. The function does not preserve the order of the columns.

Parameters
  • X (iterable) – Training data

  • y (iterable, default=None) – Training targets.

Returns

Return type

DataFrame, X with categories.

source on GitHub