XD blog

tutorial

2013-06-24 Installer pip pour Python

J'ai encore perdu des minutes pour installer le package pip qui permet d'installer tous les autres. Ca a l'air simple vu d'un premier abord sauf qu'il faut installer ses dépendances : setuptools. Et comme je ne fais pas ça tous les jours... J'ai perdu du temps à force d'être paresseux et je suis retombé sur cette page : Unofficial Windows Binaries for Python Extension Packages qui m'a permis de tout faire en trois clics. Il faut vraiment que je retienne cette page.

2013-06-13 A template to create a Python module including Sphinx documentation and a setup

My students often struggle to debug their programs when two or three students need to synchronize their versions. A good way to avoid wasting too much time is to use a tool to keep track of the modifications such as github. It then becomes easy to synchronize multiple versions.

However, students still need to debug the program after a synchronization. A good practice is to write unit tests. Every time, you write a complex function or an easy one, a unit test should be written to ensure its behaviour will not change after many changes. But it means to add a file, to spend some time to do it right, and to frequently run all the unit tests. This is usually too painful when the project will only last a couple of months. Plus, you usually commit yourself to do it only after you went through the nightmare of debugging once.

Last but not least, my students usually do not add documentation to their code. Most of the time, they do not need it because the project is too short to lose track of the modifications and too small to not know it completely. Maybe another reason is because they cannot see a compiled version of the documentation. The best way is to use Sphinx ut using it means spending a couple of hours at least (a lot of more if you do it for the first time). Documentation can also be used to navigate through the program.

For those reasons, I made a kind of template for a Python module. It includes an easy mechanism to add a unit test and to run it. It generates with the documentation with no change and it also generated a setup (gz, exe) with no change either. You can get it here: Pieces of codes, libraries (section Code). After you downloaded it, a page gives the short list of instructions to tweak the template in order to make it yours: README.

2013-05-26 Processing (big) data with Hadoop

Big Data becomes very popular nowadays. If the concept seems very simple - use many machines to process big chunks of data -, pratically, it takes a couple of hours before being ready to run the first script on the grid. Hopefully, this article will help you saving some times. Here are some directions I looked to create and submit a job map/reduce.

Unless you are very strong, there is very little chance that you develop a script without making any mistake on the first try. Every run on the grid has a cost, plus accessing a distant cluster might take some time. That's why it is convenient to be able to develop a script on a local machine. I looked into several ways: Cygwin, a virtual machine with Cloudera, a virtual machine with HortonWorks, a local installation of Hadoop on Windows. As you may have understood, my laptop OS is Windows. Setting up a virtual machine is more complex but it gives a better overview of how Hadoop works.

Here are the points I will develop:

Develop a short script in Hue and Hive on this local machine,
Install a virtual machine (VM) on a laptop,
Run this script on the grid (using Amazon AWS).

To go through all the steps, you need a machine with 30Gb free on your hard drive, and at least 4Gb memory. 64bit OS is better. I went through the steps with Windows 8 and it works on any other OS.

Contents:

Local run with Java

Installation
Executing a script PIG with Cygwin
Executing a script PIG without Cygwin

Installation of a local server with HortonWorks
Install a virtual machine (Cloudera)

Files to download
Only for French keyboards
Upload, download files to the local grid
Install the VMWare Tools and create a shared folder
Final tweaks: change the repository
Install Python 3.3, Numpy (optional)
Install R and Rpy2

Install a virtual machine (HortonWorks)
Develop a short script

Run a pig Script through the command line
Checking job execution
Same process with Hive and Hue

Hadoop and Python
Using Amazon AWS

Open an Amazon account
Run a script PIG on Amazon

Errors you might face

Cannot retrieve repository metadata (repomd.xml)
ImportError: No module named '_sqlite3'
Compilation Error for a PIG script
Error when creating a Hive table

I'll assume you are familiar with Map/Reduce concepts and you have heard about Hadoop and PIG.
more...

Xavier Dupré