XD blog

blog page

statistics


2014-04-02 References for Statistics with R

Python does not offer all the functionalities in one module, you need to look for them sometimes. In my case, I was looking for a statistical test on coefficients obtained with a linear regression. The module I was looking for is statmodels. While looking for that, I found this interesting blog Glowing Python. But I finally decided to switch to R where I know for sure I would find what I need. And because I'm not fluent in R, I need something like that : StatMethods.

2014-01-12 R or Python

Should you use R or Python? I won't give a precise answer except a reference to this blog post: Python Displacing R As The Programming Language For Data Science. To summarize, if you are a statistician, you are already using R. However, if you are not a statistician but you need statistics, you are probably wondering if you should use R and another language or just another language. R is not very well designed as a programming language and is not very suitable to manipulate files, create a web server or games... Using Python for everything avoids switching to another language. It avoids converting the data into various formats between the two languages.

With pandas, numpy, scipy, scikit-learn, matplotlib, IPython, many common statistics routines are available in Python. In the last two years, it became a really strong alternative to R. In the next years, SAS should less and less used (see Forecast Update: Will 2014 be the Beginning of the End for SAS and SPSS?). Computers speed and memory are not an issue anymore with others alternatives. Plus, it is expensive. I would also look at Julia (+ Julia Studio) which seems to be a promising language. I discovered at MCMSki IV. But maybe the future will be dedicated languages such as BUGS for bayesian models.

Finally, some articles about R and Python:

2014/06/30: I recommend reading Numeric matrix manipulation, The cheat sheet for MATLAB, Python NumPy, R, and Julia

2013-09-26 Busy areas in Paris

During summer, one pleasure is to go to work by bike. Simple option is to take a Velib but most of the time, the closest Velib station is empty. The same thing happens when you leave your work to go back home. No bike is available.

I thought maybe this could be used to draw a map of Paris showing areas where people work. I thought about looking at the distribution of the number of available bikes over a day. I already mentioned that the Velib data was available (see Les stations Vélib à Paris un jeudi soir). Next figure shows it for a couple of stations and one of them is clearly a working station: bikes arrive in the morning and disappear at the end of the working day (it was taken a couple of weeks ago during a week day).

The number of available bikes was measured every five minutes. Knowing that every station does not have the same number of spots, I normalized the previous curve by the sum. I then considered the sum between 10am and 4pm. So for each station, I built the following indicator:

 I(s) =  \frac{\sum_{t=10am}^{4pm} X(s,t) }              { \sum_{t=0am}^{11:59pm} X(s,t) }

I used the information to draw a map of Paris with the Velib stations. If I(s) > 0.25 , I used a red flag and a green otherwise.

Basically, companies offices are located in the center of Paris (districts with one digit) and around the Seine, people live around (districts with two digits). It also shows there are some business areas just outside Paris like Issy-Les-Moulineaux (where I work). You can play with the final result below. It uses OpenStreetMap and OpenLayers.


more...

Xavier Dupré