XD blog

blog page

cygwin


2013-07-16 Les trucs que je ne sais jamais quand j'en ai besoin

I read two blogs about stuff I never remember when I need it. I often manipulate text file and I know the linux tools are doing quite a great job about that. But I never remember the syntax. This blog post seems to be a good pointer: Useful Unix commands for data science.

The second is about mutex and lock. The second one is used to synchronize threads among a single application. The first one (mutex) is used to synchronize processes among them (but also threads as a consequence). And if you want to use mutex all the time because it is convenient, you should read this blog post first: Lock vs. Mutex.

2013-05-26 Processing (big) data with Hadoop

Big Data becomes very popular nowadays. If the concept seems very simple - use many machines to process big chunks of data -, pratically, it takes a couple of hours before being ready to run the first script on the grid. Hopefully, this article will help you saving some times. Here are some directions I looked to create and submit a job map/reduce.

Unless you are very strong, there is very little chance that you develop a script without making any mistake on the first try. Every run on the grid has a cost, plus accessing a distant cluster might take some time. That's why it is convenient to be able to develop a script on a local machine. I looked into several ways: Cygwin, a virtual machine with Cloudera, a virtual machine with HortonWorks, a local installation of Hadoop on Windows. As you may have understood, my laptop OS is Windows. Setting up a virtual machine is more complex but it gives a better overview of how Hadoop works.

Here are the points I will develop:

To go through all the steps, you need a machine with 30Gb free on your hard drive, and at least 4Gb memory. 64bit OS is better. I went through the steps with Windows 8 and it works on any other OS.

Contents:

I'll assume you are familiar with Map/Reduce concepts and you have heard about Hadoop and PIG.
more...

Xavier Dupré