XD blog

blog page

hadoop


2014-11-27 Some really annoying things with Hadoop

When you look for a bug, it could fail anywhere so you make assumptions on what you think is working and what could be wrong. I lost of couples of hours because I made the wrong one. I was preparing my teachings and I stored my data using Python serialization instead of json. That way, I could just use eval( string_serialized ). It already worked locally and on a first cluster hadoop when this instruction was embedded in python script. A PIG script was then calling it through streaming. I then tried the same instruction and it failed many times until I check this line was involved. And why? The error message was just not here to tell me anything about what happened. The script was just crashing. I suspect the line was just too long So, I'll do that another way. My first assumption were the schema I used in my Jython script. I finally chose to save that issue by considering only strings. I commented out line after line until this one out finally made my job work. That's the part of computing science I don't like. Long, full of guesses, impossible to accomplish without an internet connexion to dig into the vast amount of recent and outdated examples. And maybe in a couple of months, this issue will be solved by a simple update.

This night never happened. That's what I'll keep in mind. This night never happened.

2013-05-26 Processing (big) data with Hadoop

Big Data becomes very popular nowadays. If the concept seems very simple - use many machines to process big chunks of data -, pratically, it takes a couple of hours before being ready to run the first script on the grid. Hopefully, this article will help you saving some times. Here are some directions I looked to create and submit a job map/reduce.

Unless you are very strong, there is very little chance that you develop a script without making any mistake on the first try. Every run on the grid has a cost, plus accessing a distant cluster might take some time. That's why it is convenient to be able to develop a script on a local machine. I looked into several ways: Cygwin, a virtual machine with Cloudera, a virtual machine with HortonWorks, a local installation of Hadoop on Windows. As you may have understood, my laptop OS is Windows. Setting up a virtual machine is more complex but it gives a better overview of how Hadoop works.

Here are the points I will develop:

To go through all the steps, you need a machine with 30Gb free on your hard drive, and at least 4Gb memory. 64bit OS is better. I went through the steps with Windows 8 and it works on any other OS.

Contents:

I'll assume you are familiar with Map/Reduce concepts and you have heard about Hadoop and PIG.
more...

Xavier Dupré