module mlmodel.categories_to_integers
#
Short summary#
module mlinsights.mlmodel.categories_to_integers
Implements a transformation which can be put in a pipeline to transform categories in integers.
Classes#
class |
truncated documentation |
---|---|
Does something similar to what DictVectorizer … |
Properties#
property |
truncated documentation |
---|---|
|
HTML representation of estimator. This is redundant with the logic of _repr_mimebundle_. The latter should … |
Methods#
method |
truncated documentation |
---|---|
usual |
|
Concatenates all the categories given the information stored in _categories. |
|
Makes the list of all categories in input X. X must be a dataframe. |
|
Fits and transforms categories in numerical features based on the list of categories found by method fit. … |
|
Transforms categories in numerical features based on the list of categories found by method fit. X must … |
Documentation#
Implements a transformation which can be put in a pipeline to transform categories in integers.
- class mlinsights.mlmodel.categories_to_integers.CategoriesToIntegers(columns=None, remove=None, skip_errors=False, single=False)#
Bases:
BaseEstimator
,TransformerMixin
Does something similar to what DictVectorizer does but in a transformer. The method fit retains all categories, the method transform transforms categories into integers. Categories are sorted by columns. If the method transform tries to convert a categories which was not seen by method fit, it can raise an exception or ignore it and replace it by zero.
DictVectorizer or CategoriesToIntegers
Example which transforms text into integers:
<<<
import pandas from mlinsights.mlmodel import CategoriesToIntegers df = pandas.DataFrame([{"cat": "a"}, {"cat": "b"}]) trans = CategoriesToIntegers() trans.fit(df) newdf = trans.transform(df) print(newdf)
>>>
cat=a cat=b 0 1.0 NaN 1 NaN 1.0
- Parameters:
columns – specify a columns selection
remove – modalities to remove
skip_errors – skip when a new categories appear (no 1)
single – use a single column per category, do not multiply them for each value
The logging function displays a message when a new dense and big matrix is created when it should be sparse. A sparse matrix should be allocated instead.
- __init__(columns=None, remove=None, skip_errors=False, single=False)#
- Parameters:
columns – specify a columns selection
remove – modalities to remove
skip_errors – skip when a new categories appear (no 1)
single – use a single column per category, do not multiply them for each value
The logging function displays a message when a new dense and big matrix is created when it should be sparse. A sparse matrix should be allocated instead.
- __str__()#
usual
- _build_schema()#
Concatenates all the categories given the information stored in _categories.
- Returns:
list of columns, beginning of each
- fit(X, y=None, **fit_params)#
Makes the list of all categories in input X. X must be a dataframe.
- Parameters:
X – iterable Training data
y – iterable, default=None Training targets.
- Returns:
self
- fit_transform(X, y=None, **fit_params)#
Fits and transforms categories in numerical features based on the list of categories found by method fit. X must be a dataframe. The function does not preserve the order of the columns.
- Parameters:
X – iterable Training data
y – iterable, default=None Training targets.
- Returns:
Dataframe, X with categories.
- transform(X, y=None, **fit_params)#
Transforms categories in numerical features based on the list of categories found by method fit. X must be a dataframe. The function does not preserve the order of the columns.
- Parameters:
X – iterable Training data
y – iterable, default=None Training targets.
- Returns:
DataFrame, X with categories.