XD blog

blog page

scikit-learn


2019-10-05 Quelques essais de benchmarks

J'ai découvert ou plutôt on m'a fait découvrir le benchmark officiel de scikit-learn scikit-learn_benchmarks que je fais tourner dans une version augmentée scikit-learn_benchmarks + ONNX. Il a quelques sautes d'humeur car je fais tourner plein de choses sur la même machine. Ils ont d'ailleurs tous quelques sautes d'humeur.

Ensuite, je me suis amusé à créer un benchmark automatique pour tous les modèles de scikit-learn toujours en utilisant le module asv. Puis un autre... Bref, tout est là Benchmarks.

2019-04-01 Determines close leaves in a decision tree

That's a problem I had in mind yesterday. When scikit-learn builds a decision tree, we might want to say which classes share a border with another one, which I translated by which couples of leaves of a decision tree share a border. The final node determines which feature to use to split between two leaves and two classes. What can we say about two leaves far away in the tree structure? Do they share a border? We could use the training data to build a kind of Voronoï diagram for points and group cells which belong to the same leave. What if we do not have the training data?

My answer is implemented somewhere on my website. This question was something I was looking into to imagine a way to build a continuous piecewise linear regression with at least two features... which is impossible but still finding close leaves seemed a good algorithmic problem.

2019-03-10 Open Source

First time in my life when whatever I do is open source and on GitHub. I remember when I left my first company, it was quite annoying to leave everything I contributed to behind me and not being able to look what it became from time to time. It is like coming back to your hometown, it is a place you know very well and quite hard to leave for ever.

I made a page for all the open source projects I work on. Most of them are my own, a couple of them contains the teaching I do, some others automate the publishing of the first one, some help me during my daily life, the last ones are Microsoft's one.

2019-02-28 Sprint scikit-learn

J'ai vécu mon premier sprint scikit-learn. J'avais posté au préalable une issue Faster PolynomialFeatures à laquelle j'ai proposé une solution Fixes #13173, implements faster polynomial features for dense matrices. Je recommande l'aventure à tous ceux qui souhaitent comprendre comment on construit une librairie de machine learning qui plaise au plus grande nombre. J'y ai croisé des chercheurs venus de tous horizons, des contributeurs de scikit-learn, venus pour réfléchir sur les prochains grands défis de la librairie que je résumérais par : Comment répondre à de nouveaux usages tout en conservant la simplicité du design actuel ?.


more...

2016-02-21 scikit-learn, dask and map reduce and examples on Python about machine learning

This is blog post about a couple of topics. The first one is about parallelizing a scikit-learn pipeline with dask. Pipelines and Reuse with dask. A little bit more and dask Introducing Dask Distributed. The module distributed distribute work on a local machine with syntax very close to map/reduce syntax. An example taken from the documentation:

def square(x): return x ** 2
def neg(x): return -x

from distributed import Executor
executor = Executor('127.0.0.1:8786')

A = executor.map(square, range(10))
B = executor.map(neg, A)
total = executor.submit(sum, B)
total.result()

I was being asked where to find examples, scripts about machine learning. Most of them happen to be written within a notebook such as the one posted on Kaggle: Python script posted on kaggle or examples from this blog: Yhat blog. Other examples can be found at ENSAE / Bibliography / Python for a Data Scientist. If you are looking for more examples of code, pick one kaggle competition and look for it on a search engine after adding the word "github", you may be able to find interesting projects. Just try kaggle github to begin with.

2015-02-26 Use scikit-learn with your own model

scikit-learn has a very simple API and it is quite simple to use its features with your own model. It just needs to be embbeded into a class which implements the methods fit, predict, decision_function, score. I wrote a simple model (kNN) which follows those guidelines: SkCustomKnn. A last method is needed for the cross validation scenario. This one needs to clone the machine learned model. It just calls the constructor with proper parameters. To do so, it needs to get a copy of those. That is the purpose of method get_params. You are all set.


Xavier Dupré