XD blog

cygwin

2013-07-16 Les trucs que je ne sais jamais quand j'en ai besoin

I read two blogs about stuff I never remember when I need it. I often manipulate text file and I know the linux tools are doing quite a great job about that. But I never remember the syntax. This blog post seems to be a good pointer: Useful Unix commands for data science.

The second is about mutex and lock. The second one is used to synchronize threads among a single application. The first one (mutex) is used to synchronize processes among them (but also threads as a consequence). And if you want to use mutex all the time because it is convenient, you should read this blog post first: Lock vs. Mutex.

2013-05-26 Processing (big) data with Hadoop

Big Data becomes very popular nowadays. If the concept seems very simple - use many machines to process big chunks of data -, pratically, it takes a couple of hours before being ready to run the first script on the grid. Hopefully, this article will help you saving some times. Here are some directions I looked to create and submit a job map/reduce.

Unless you are very strong, there is very little chance that you develop a script without making any mistake on the first try. Every run on the grid has a cost, plus accessing a distant cluster might take some time. That's why it is convenient to be able to develop a script on a local machine. I looked into several ways: Cygwin, a virtual machine with Cloudera, a virtual machine with HortonWorks, a local installation of Hadoop on Windows. As you may have understood, my laptop OS is Windows. Setting up a virtual machine is more complex but it gives a better overview of how Hadoop works.

Here are the points I will develop:

Develop a short script in Hue and Hive on this local machine,
Install a virtual machine (VM) on a laptop,
Run this script on the grid (using Amazon AWS).

To go through all the steps, you need a machine with 30Gb free on your hard drive, and at least 4Gb memory. 64bit OS is better. I went through the steps with Windows 8 and it works on any other OS.

Contents:

Local run with Java

Installation
Executing a script PIG with Cygwin
Executing a script PIG without Cygwin

Installation of a local server with HortonWorks
Install a virtual machine (Cloudera)

Files to download
Only for French keyboards
Upload, download files to the local grid
Install the VMWare Tools and create a shared folder
Final tweaks: change the repository
Install Python 3.3, Numpy (optional)
Install R and Rpy2

Install a virtual machine (HortonWorks)
Develop a short script

Run a pig Script through the command line
Checking job execution
Same process with Hive and Hue

Hadoop and Python
Using Amazon AWS

Open an Amazon account
Run a script PIG on Amazon

Errors you might face

Cannot retrieve repository metadata (repomd.xml)
ImportError: No module named '_sqlite3'
Compilation Error for a PIG script
Error when creating a Hive table

I'll assume you are familiar with Map/Reduce concepts and you have heard about Hadoop and PIG.
more...

Xavier Dupré