# Lightgbm, double, discrepencies#

Discrepencies usually happens with lightgbm because its code is used double to represent the threshold of trees as ONNX is using float only. There is no way to fix this discrepencies unless the ONNX implementation of trees is using double.

```from jyquickhelper import add_notebook_menu
```
```%load_ext mlprodict
```

## Simple regression problem#

Target y is multiplied by 10 to increase the absolute discrepencies. Relative discrepencies should not change much.

```from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(2000, n_features=10)
y *= 10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
```
```min(y), max(y)
```
```(-5645.317056441552, 5686.0775071009075)
```

## Training a model#

Let’s train many models to see how they behave.

```from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
```
```models = [
RandomForestRegressor(n_estimators=50, max_depth=8),
LGBMRegressor(n_estimators=50, max_depth=8),
XGBRegressor(n_estimators=50, max_depth=8),
]
```
```from tqdm import tqdm
for model in tqdm(models):
model.fit(X_train, y_train)
```
```100%|██████████| 5/5 [00:01<00:00,  3.96it/s]
```

## Conversion to ONNX#

We use function to_onnx from this package to avoid the trouble of registering converters from onnxmltools for lightgbm and xgboost libraries.

```from mlprodict.onnx_conv import to_onnx
import numpy
onnx_models = [to_onnx(model, X_train[:1].astype(numpy.float32), rewrite_ops=True)
for model in models]
```
```C:xavierdupre__home_github_forkscikit-learnsklearnutilsdeprecation.py:101: FutureWarning: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use 'n_features_in_' instead.
warnings.warn(msg, category=FutureWarning)
C:xavierdupre__home_github_forkscikit-learnsklearnutilsdeprecation.py:101: FutureWarning: Attribute n_classes_ was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
warnings.warn(msg, category=FutureWarning)```
```simple_onx = to_onnx(LGBMRegressor(n_estimators=3, max_depth=4).fit(X_train, y_train),
X_train[:1].astype(numpy.float32), rewrite_ops=True)
%onnxview simple_onx
```

## Discrepencies with float32#

```from onnxruntime import InferenceSession
from pandas import DataFrame

def max_discrepency(X, skl_model, onx_model):
expected = skl_model.predict(X).ravel()

sess = InferenceSession(onx_model.SerializeToString())
got = sess.run(None, {'X': X})[0].ravel()

diff = numpy.abs(got - expected).max()
return diff

obs = []
x32 = X_test.astype(numpy.float32)
for model, onx in zip(models, onnx_models):
diff = max_discrepency(x32, model, onx)
obs.append(dict(name=model.__class__.__name__, max_diff=diff))

DataFrame(obs)
```
name max_diff
0 RandomForestRegressor 0.000493
3 LGBMRegressor 0.000924
4 XGBRegressor 0.000977
```DataFrame(obs).set_index("name").plot(kind="bar").set_title("onnxruntime + float32");
```

## Discrepencies with mlprodict#

This is not available with the current standard ONNX specifications. It required mlprodict to implement a runtime for tree ensemble supporting doubles.

```from mlprodict.onnxrt import OnnxInference
from pandas import DataFrame

def max_discrepency_2(X, skl_model, onx_model):
expected = skl_model.predict(X).ravel()

sess = OnnxInference(onx_model)
got = sess.run({'X': X})['variable'].ravel()

diff = numpy.abs(got - expected).max()
return diff

obs = []
x32 = X_test.astype(numpy.float32)
for model, onx in zip(models, onnx_models):
diff = max_discrepency_2(x32, model, onx)
obs.append(dict(name=model.__class__.__name__, max_diff=diff))

DataFrame(obs)
```
name max_diff
0 RandomForestRegressor 0.000798
3 LGBMRegressor 0.001288
4 XGBRegressor 0.000122
```DataFrame(obs).set_index("name").plot(kind="bar").set_title("mlprodict + float32");
```

## Discrepencies with mlprodict and double#

The conversion needs to happen again.

```simple_onx = to_onnx(LGBMRegressor(n_estimators=2, max_depth=2).fit(X_train, y_train),
X_train[:1].astype(numpy.float64), rewrite_ops=True)
%onnxview simple_onx
```
```C:xavierdupremicrosoft_githubsklearn-onnxskl2onnxcommon_container.py:603: UserWarning: Unable to find operator 'TreeEnsembleRegressorDouble' in domain 'mlprodict' in ONNX, op_version is forced to 1.
warnings.warn(```
```onnx_models_64 = []
for model in tqdm(models):
onx = to_onnx(model, X_train[:1].astype(numpy.float64), rewrite_ops=True)
onnx_models_64.append(onx)
```
```  0%|          | 0/5 [00:00<?, ?it/s]C:xavierdupre__home_github_forkscikit-learnsklearnutilsdeprecation.py:101: FutureWarning: Attribute n_features_ was deprecated in version 1.0 and will be removed in 1.2. Use 'n_features_in_' instead.
warnings.warn(msg, category=FutureWarning)
20%|██        | 1/5 [00:02<00:09,  2.40s/it]C:xavierdupre__home_github_forkscikit-learnsklearnutilsdeprecation.py:101: FutureWarning: Attribute n_classes_ was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
warnings.warn(msg, category=FutureWarning)
100%|██████████| 5/5 [00:04<00:00,  1.16it/s]```
```obs64 = []
x64 = X_test.astype(numpy.float64)
for model, onx in zip(models, onnx_models_64):
oinf = OnnxInference(onx)
diff = max_discrepency_2(x64, model, onx)
obs64.append(dict(name=model.__class__.__name__, max_diff=diff))

DataFrame(obs64)
```
name max_diff
0 RandomForestRegressor 2.273737e-12
3 LGBMRegressor 4.686752e-05
4 XGBRegressor 1.562066e-03
```DataFrame(obs64).set_index("name").plot(kind="bar").set_title("mlprodict + float64");
```
```df = DataFrame(obs).set_index('name').merge(DataFrame(obs64).set_index('name'),
left_index=True, right_index=True)
df.columns = ['float32', 'float64']
df
```
float32 float64
name
RandomForestRegressor 0.000798 2.273737e-12
```import matplotlib.pyplot as plt