XD blog

blog page

machine learning, python


2013-09-15 Python extensions to do machine learning

I started to compare the functionalities of some Python extensions (the list is not exhaustive) :

The first one (scikit-learn) covers many features and its documentation is quite clear. When a model is missing, you can look into PyBrain for Reinforcement Learning, in Gensim for Dirichlet Application (Latent, Hierarchical) and in NLTK for any text processing (tokenization for example). For those who do not want to code, Orange would be a good option. The module Theano does gradient optimization using GPU.

A couple of forums, kind of FAQ for machine learning:

It would be difficult to do machine learning without using visualization tools. matplotlib and ggplot would be a good way to start. We also manipulate tables: numpy and pandas. For a command line: ipython or bpython are two common options.

If you are looking for data UC Irvine Machine Learning Repository. If you work with Windows, many of the presented modules can be downloaded from Unofficial Windows Binaries for Python Extension Packages. It also gives a clear view of what package is available on which Python's version.

Next table summarizes where you can find which features (with some errors):
scikit-learnstatsmodelsmlpyMDPPyBrainTheanoMILKpyMVPANLTKGensimOrange
AdaBoostyesyes
ANOVAyesyes
ARMA (Time Series)yes
C4.5yesyes
Canonical Correlation Analysisyesyes
Cross Validationyesyes
DBSCANyes
Decision Treesyesyesyesyes
Deep Belief Networksyes
Dictionary Learningyes
Dynamic Time Warping (yes)yes
Elastic Netyesyesyesyes
Evolution Strategies (ES)yes
Fast ICAyesyes
Fast/Partial PCAyes
Features Selectionyesyesyesyes
Gaussian Mixture Modelyesyes
Gaussian Naive Bayesyesyes
Genetic Algorithmyes
Golub Classifieryes
GPU computationyes
Gradient Based Optimizationyesyes
Gradient Boosted Treeyes
Gradient Boosting Regressionyesyes
Grid Searchyes
Hidden Markov Model with Gaussian Mixture Emissions (HMM GMM)yesyes
Hierarchical Clustering (Ward…)yesyesyesyes
Hierarchical Dirichlet Application (HDP)yes
ICAyesyes
Isotonic Regressionyes
KDTreeyes
Kernal Densityyesyes
Kernel Fisher Discriminantyes
Kernel PCAyesyesyes
Kernel Regressionyes
Kernel Ridge Regressionyes
k-Meansyesyesyesyesyes
k-NNyesyesyesyesyes
Label Spreadingyesyes
Largest Common Subsequence (LCS)
Lassoyesyes
Large Linear Classificationyes
Latent Dirichlet Application (LDA)yes
Least Angle Regression (LARS)yesyesyes
Linear Discriminant Analysis (LDA)yesyesyesyes
Linear Regressionyesyesyesyesyesyesyes
Logisitic Regressionyesyesyesyesyes
Naive Bayesian Learneryesyes
Natural Language Processing (NLP)yes
Neural Network (NN)yesyesyesyes
Non-Negative matrix factorization by Projected Gradient (NMF)yesyes
Partial Least Square (PLS)yesyes
Partial Least Square (SVD)yes
Particle Swarm Optimization (PSO)yes
Passive Aggressive Classificationyes
Passive Aggressive Regressionyes
Pipelineyesyes
Principal Component Analysis (PCA)yesyesyesyesyes
Probabilistic Principal Component Analysis (pPCA)yesyesyes
p-Valueyesyes
Quadratic Discriminant Analysis (QDA)yesyes
Random Forestsyesyesyesyes
Recurrent Neural Networkyes
Regression Treeyesyesyes
Reinforcement Learningyes
Ridge Regressionyesyesyesyes
ROC / Precision / Recallyesyes
SARSAyes
Self Organizing Map (SOM - Kohonen)yesyesyes
Singular Value Decomposition (SVD)yesyes
Sparse PCAyesyes
Spectral BiClusteringyes
Spectral Clusteringyes
Spectral Coclusteringyes
Spectral Regression Discriminant Analysisyes
Support Vector Classificiation (SVC)yesyesyes
Support Vector Machine (SVM)yesyesyesyesyesyesyes
Support Vector Regression (SVR)yesyes
TF-IDFyesyes
Waveletsyesyes

If the model you need is not in the previous list, you can use rpy2 to communicate with R where you will surely find a related package.

2014/09/03: you can also read Python Tools for Machine Learning.


<-- -->

Xavier Dupré