J'ai encore perdu des minutes pour installer le package pip qui permet d'installer tous les autres. Ca a l'air simple vu d'un premier abord sauf qu'il faut installer ses dépendances : setuptools. Et comme je ne fais pas ça tous les jours... J'ai perdu du temps à force d'être paresseux et je suis retombé sur cette page : Unofficial Windows Binaries for Python Extension Packages qui m'a permis de tout faire en trois clics. Il faut vraiment que je retienne cette page.
My students often struggle to debug their programs when two or three students need to synchronize their versions. A good way to avoid wasting too much time is to use a tool to keep track of the modifications such as github. It then becomes easy to synchronize multiple versions.
However, students still need to debug the program after a synchronization. A good practice is to write unit tests. Every time, you write a complex function or an easy one, a unit test should be written to ensure its behaviour will not change after many changes. But it means to add a file, to spend some time to do it right, and to frequently run all the unit tests. This is usually too painful when the project will only last a couple of months. Plus, you usually commit yourself to do it only after you went through the nightmare of debugging once.
Last but not least, my students usually do not add documentation to their code. Most of the time, they do not need it because the project is too short to lose track of the modifications and too small to not know it completely. Maybe another reason is because they cannot see a compiled version of the documentation. The best way is to use Sphinx ut using it means spending a couple of hours at least (a lot of more if you do it for the first time). Documentation can also be used to navigate through the program.
For those reasons, I made a kind of template for a Python module. It includes an easy mechanism to add a unit test and to run it. It generates with the documentation with no change and it also generated a setup (gz, exe) with no change either. You can get it here: Pieces of codes, libraries (section Code). After you downloaded it, a page gives the short list of instructions to tweak the template in order to make it yours: README.
Big Data becomes very popular nowadays. If the concept seems very simple - use many machines to process big chunks of data -, pratically, it takes a couple of hours before being ready to run the first script on the grid. Hopefully, this article will help you saving some times. Here are some directions I looked to create and submit a job map/reduce.
Unless you are very strong, there is very little chance that you develop a script without making any mistake on the first try. Every run on the grid has a cost, plus accessing a distant cluster might take some time. That's why it is convenient to be able to develop a script on a local machine. I looked into several ways: Cygwin, a virtual machine with Cloudera, a virtual machine with HortonWorks, a local installation of Hadoop on Windows. As you may have understood, my laptop OS is Windows. Setting up a virtual machine is more complex but it gives a better overview of how Hadoop works.
Here are the points I will develop:
Contents: