When you look for a bug, it could fail anywhere so you make assumptions on what you think is working and what could be wrong. I lost of couples of hours because I made the wrong one. I was preparing my teachings and I stored my data using Python serialization instead of json. That way, I could just use eval( string_serialized ). It already worked locally and on a first cluster hadoop when this instruction was embedded in python script. A PIG script was then calling it through streaming. I then tried the same instruction and it failed many times until I check this line was involved. And why? The error message was just not here to tell me anything about what happened. The script was just crashing. I suspect the line was just too long So, I'll do that another way. My first assumption were the schema I used in my Jython script. I finally chose to save that issue by considering only strings. I commented out line after line until this one out finally made my job work. That's the part of computing science I don't like. Long, full of guesses, impossible to accomplish without an internet connexion to dig into the vast amount of recent and outdated examples. And maybe in a couple of months, this issue will be solved by a simple update.
This night never happened. That's what I'll keep in mind. This night never happened.
Big Data becomes very popular nowadays. If the concept seems very simple - use many machines to process big chunks of data -, pratically, it takes a couple of hours before being ready to run the first script on the grid. Hopefully, this article will help you saving some times. Here are some directions I looked to create and submit a job map/reduce.
Unless you are very strong, there is very little chance that you develop a script without making any mistake on the first try. Every run on the grid has a cost, plus accessing a distant cluster might take some time. That's why it is convenient to be able to develop a script on a local machine. I looked into several ways: Cygwin, a virtual machine with Cloudera, a virtual machine with HortonWorks, a local installation of Hadoop on Windows. As you may have understood, my laptop OS is Windows. Setting up a virtual machine is more complex but it gives a better overview of how Hadoop works.
Here are the points I will develop:
Contents: