XD blog

blog page

big data, python, sql

2014-07-19 Data can be huge, don't panic!

Data can be huge. Even if you reduce it, even if you sample, it seems there is no end to it and every look into it seems so slow! So slow! Hundred of millions of rows to read every time you try to find something. That's the kind of issues I ran into when I first met data from Internet. It was almost six years ago. I realize now there might be better ideas but, back then, I used SQLite to avoid storing everything in memory because I could not. 3 Gb, even 6 Gb could not hold in my laptop memory six years ago. However, switching from flat files to SQL table is painful. Writing the schema is painful, at least to me. So I did a function which guesses it from any flat file and... well, I used some tricks, a couple are described here: Mix SQLite and DataFrame. Whether they are useful is totally up to you.

It takes 10 minutes to imagine a way to deal with that mountain of data. It takes a week to build the first tool (in Python). It takes a month to cover your most painful daily tasks (still in Python because that's not the only thing you have to do). It takes a year to convince people to use your tools because they have habits too and they don't understand why you believe so much yours are better.

Well, do not let the machine dictate you what to do. If you want to walk through 1 Gb of data in a second, there is always a way, you just have to find it. And share.

<-- -->

Xavier Dupré