XD blog

blog page

python


2015-04-06 Blog generator

I publish my teaching material as python module. I added some tricks to made that happen. Recently, I was wondering how to add some kind of blog posts inside the documentation. As I have several teaching going on, I did not want to merge all blog posts into a single one where students would have to filter out what blog post is meant for them. So I thought about using a kind of blog generator written in Python on the top of Sphinx. I went through that blog post What's the best available static blog/website generator in Python? which gives a short list of them. It is possible to check their popularity by looking at Top Open-Source Static Site Generators.

I wanted to follow the same design for my blog, same pattern. So I was looking for a tool generated RST files and not directly HTML. Tinkerer seemed a good choice. Should I have to add the message powered by Tinkerer as every site using it is displaying that sentance? I also looked into Pelican, nykola. I also found a very simple one with a French name: éClaircie.

I finally decided to write some code to process my own blog posts and to insert them in the documentation of a an existing python module. It is not finalized yet but it looks like that: An example of a blog post included in the documentation. This process forces me to dig into sphinx devext API which I do not fully understand yet. It is quite difficult to find good examples on the web. What I have implemented is available here: pyquickhelper.helpgen.

By implementing my own blog, I cannot have all the features the static generators have (good templating, many languages). I spent most of my time in implementing the blog post aggregations (categories, months) and the splitting (not more than 10 blog posts per pages). But now, if I want to customize Sphinx a little bit, it is easier.

2015-03-31 GitHub, mais pourquoi ?

GitHub, c'est quoi ? En langage technique, on appelle ça un logiciel de suivi de source ou logiciel de gestion de version. On s'en sert dès qu'on travail sur des fichiers et à plusieurs. Il permet de garder la trace de toutes les modifications. L'article de Rue89 en dit un peu plus à ce sujet : Qu'est-ce que tous les techos du monde font sur GitHub ?. Aujourd'hui, on n'imagine plus s'en passer. D'ailleurs tous mes enseignements y sont : github xavier.

Même si l'outil a été développé pour développer du code informatique, il peut servir pour suivre les modifications de n'importe quel texte y compris le code civil et les images. Ca marche un peu moins bien voire souvent pas du tout pour tous les formats complexes, surtout s'ils sont propriétaires.

GitHub est gratuit pour tous les projets publics. Il faut payer si on ne veut pas exposer ses sources au public. On peut aussi aller chez le concurrent BitBucket dont les conditions tarifaires sont différentes. Si on ne souhaite pas du tout que ses sources soient hébergées par une compagnie tierce, on peut installer un serveur GitLab chez soi. Et si on souhaite juste suivre ses modifications sur son ordinateur en local, on peut installer juste Git, avec TortoiseGit.

Si vous êtes courageux, vous pouvez aller jusqu'à regarder les outils d'intégration continue tels que Travis CI ou GitLab CI.

2015-03-26 Drawing in a notebook

My plan was quite simple : create a kind of small window in a notebook where I can click and mark some points. Once it is done, I retrieve the points and I run a simple algorithm to solve the Travelling Salesman Problem.

The notebook is there Voyageur de commerce and the javascript code is generated by the following function: display_canvas_point.

2015-03-07 Work on the features or the model

Sometimes, a machine learned model does not get it. It does not find any way to properly classify the data. Sometimes, you know it could work better with another model but it cannot be trained on such an amount of data. So what...

Another direction consists in looking for non linear combinations of existing features which could explain better the border between two classes. Let's consider this known difficult example:

It cannot be linearly separated but it can with others kinds of models (k-NN, SVC). However, by adding simple multiplications between existing features, the problem becomes linear:

The point is: if you know that a complex features would really help your model, it is worth spending time implementing it rather that trying to approximating it by using a more complex model. (corresponding notebook).

2015-03-01 Automated build of pipelines on Jenkins

Jenkins is an interesting tools. You can schedule jobs, manage dependencies between or even display pipelines. Below follows the one I use for my teachings which consists in many helpers to generate documentation, proposes various magic commands for ipython, test all notebooks are working fine.

2015-02-28 Automated build on Travis for a python module

Many python modules display a small logo which indicates the build status: . I set up the same for the module pyquickhelper which is held on github/pyquickhelper. Travis installs packages before building the modules. The first step is to gather all the dependencies:

pip freeze > requirements.txt

I replaced == by >= and removed some of them, I got:

Cython>=0.20.2
Flask>=0.10.1
Flask-SQLAlchemy>=2.0
Jinja2>=2.7.3
Markdown>=2.4.1
...

more...

2015-02-26 Use scikit-learn with your own model

scikit-learn has a very simple API and it is quite simple to use its features with your own model. It just needs to be embbeded into a class which implements the methods fit, predict, decision_function, score. I wrote a simple model (kNN) which follows those guidelines: SkCustomKnn. A last method is needed for the cross validation scenario. This one needs to clone the machine learned model. It just calls the constructor with proper parameters. To do so, it needs to get a copy of those. That is the purpose of method get_params. You are all set.

2015-02-21 Distribution pour Python sous Windows

La distribution WinPython propose maintenant Python 3.4 mais aussi des versions customisées (ou flavors). L'une d'entre elles utilise Kivy. Une autre est particulièrement intéressante pour un datascientist puisqu'elle inclue R. On peut alors passer facilement de Python à R depuis le même notebooks sans étape d'installation supplémentaire ce qu'on teste aisément avec un notebook préinstallé. Comme le compilateur MinGW fait partie de la distribution, cython ne pose plus aucun problème.

Avec cette dernière version, le choix entre WinPython et Anaconda devient difficile sous Windows. Un seul bémol, l'installation du module paramiko est très simple avec Anaconda (avec conda install) mais se révèle compliquée avec WinPython. Donc, si vous avez besoin d'accéder à des ressources web de façon cryptée, Anaconda reste sans doute le plus sûr.

2015-02-16 Delay evaluation

The following class is meant to be a kind of repository of many tables. Its main issue it is loads everything first. It takes time and might not be necessary if not all the tables are required.

import pandas

class DataContainer:
    def __init__( self, big_tables ):
        self.big_tables = big_tables
        
    def __getitem__(self, i):
        return self.big_tables[i]
        
filenames = [ "file1.txt", "files2.txt" ]
          
def load(filename):
    return pandas.read_csv(filename, sep="\t")
    
container = DataContainer ( [ load(f) for f in filenames ] )

So the goal is to load the data only when it is required. But I would like to avoid tweaking the interface of class. And the logic loading the data is held outside the container. However I would an access to the container to activate the loading of the data. Si instead of giving the class DataContainer the data itself, I give it a function able to load the data.

def memoize(f):
    memo = {}
    def helper(self, x):
        if x not in memo:            
            memo[x] = f(self, x)
        return memo[x]
    return helper        
        
class DataContainerDelayed:
    def __init__( self, big_tables ):
        self.big_tables = big_tables
        
    @memoize
    def __getitem__(self, i):
        return self.big_tables[i]()
        
container = DataContainerDelayed ( [ lambda t=f : load(t) for f in filenames ] )        
for i in range(0,2): print(container[i])

But I would like to avoid loading the data only one time. So I used a memoize mechanism.

2015-02-09 Jouer à Space Invaders à coup de ligne de code

Si vous ne me croyez pas, aller voir ici : codingame. Ce n'est pas vraiment un jeu d'arcade mais il s'agit d'implémenter une stratégie qui vous permette de résoudre un jeu sans joystick. Allez voir le blog.

2015-02-05 Run a IPython notebook offline

I intensively use notebooks for my teachings and I recently noticed that some of them fail because of I updated a module or I did some changes to my python installation. So I thought I looked for a way to run my notebooks in batch mode. I found runipy which runs a notebook and catches exception it raises. After a couple of tries, I decided to modify the code to get more infos when it fails. It ended up with a function run_notebook:

from pyquickhelper.ipythonhelper.notebook_helper import run_notebook
output = run_notebook(notebook_filename, 
             working_dir=folder, 
             outfilename=outfile)

I think it is going to save some time from one year to the next one.

2015-01-20 Download a file from Dropbox with Python

It is tempting to do everything from a IPython notebook such as downloading a file from DropBox. On the web interface, when a user click on a file, a button Download shows up. A second click on this button and the file will be downloaded it. To retrieve the file from a notebook, the url of the page which contains the button but it is close from the good one. This leads to the following example:

url = "https://www.dropbox.com/[something]/[filename]?dl=1"  # dl=1 is important
import urllib.request
u = urllib.request.urlopen(url)
data = u.read()
u.close()

with open([filename], "wb") as f :
    f.write(data)

It first downloads the data as bytes and then stores everything into a file.

2015-01-19 Install a Python module with Wheel

Wheel is going to be the new way to install modules with Python. According to pythonwheels.com, many packages are already available and the site Unofficial Windows Binaries for Python Extension Packages already proposes modules in wheel format.

Windows and Linux works the same way now. It requires to install wheel first:

pip install wheel

The next step consists in downloading a wheel file .whl. An example with pandas:

pip install pandas-0.15.2-cp27-none-win_amd64.whl

2015-01-15 Projets informatiques, ENSAE 1A

Liste des sujets suggérés. Le hors piste est encouragé.

2014-12-21 Unit test a Flask application

It is usually quite annoying to develop a web application and to test the code by just running it and checking it produces what you expect for a given request. It would be more simple to write a function which launches the web application on a local machine and retrieves the content of a given page. That describes what a unit test could be used for.

I used Flask to do it. I hesitated to choose bottle but I needed to be able to shutdown the application by some means. I found a way for Flask faster than for Bottle. That's why I used the first one.


more...
<-- -->

Xavier Dupré