.. _skewdatareducerst: ================ Reduce skew data ================ .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/expose/skewdata_reduce.ipynb|*` Map/Reduce is often used to compute statistics on words. For example, you want to study the sentance from Wikipedia containing a specific word or let’s say compute the length distribution of all sentance including a specific word :math:`w`. The map/reduce job would look lihe this: - consider the data set A which contains all the pages from Wikipedia - split every page into sentances - split every sentance into words, emit three columns: word, length(sentance) - reduce on word, compute the distribution of the sentance length for each word, emit two columns: word, histogram of sentance length Globally, the total cost of the job is :math:`O(N_w)` where :math:`N_w` is the total number of words. Wikipedia might contain several billions of words. That process could be easily processed on a single machine. However, let’s assume now, instead of computing the distribution of the sentance length, we would like to apply a much heavier process. For example, we could consider the set of all sentance including a word :math:`w` and try to guess the different meaning of this word. For example, we could try to guess the number of different `meanings of a word `__ based on a clustering of co-occurrence matrix: :math:`M_w(w_i,w_j)=` *number of times the words* :math:`w,w_i,w_j` *appear in a same sentance*. Let’s consider the following jobs: - consider the data set A which contains columns :math:`w,sentance` - for all rows, for all pair of words :math:`(w_1 \neq w_2)`, emit :math:`w,w_1,w_2` - reduce on :math:`w,w_1,w_2,n` where :math:`n` is the number of rows sharing the same triplet :math:`w,w_1,w_2` - reduce on :math:`w`, cluster the matrix build from all rows sharing the same :math:`w`, emits the most common word of each cluster The last step is the longest one. The goad is not to implement a clustering algorithm but to study what happen is the data set is skewed. So to summarize, the last step applies a complex method of cost :math:`O(f(n))` for each word :math:`w` and :math:`n` is somehow related to the number of sentances :math:`n(w)` containing that word. The total cost of the last step will be: :math:`\propto C=\sum_w f(n(w))`. We also defined :math:`N=\sum_w n(w)` which is simply the number of observations of the data set. When dealing with words, we often observe that any count distribution follows a random law close to a `Zipf’s law `__. In other terms, it means that: .. math:: P(n(w)=i)=\frac{K}{i^s} \text{ with } K=\sum_i i^{-s} \text{ and } s>1 This way, :math:`N = K\sum_i i^{1-s}`. The Zipf’s law is just an assumption on how these observations will be aggregated against the key used to reduce the data set. Let’s see an example: .. code:: ipython3 %matplotlib inline .. code:: ipython3 from numpy.random import zipf from numpy import histogram import matplotlib.pyplot as plt v = zipf(2,20) print(v) v = zipf(2,200) hist = histogram(v, bins=range(1,max(v)+1)) plt.bar(hist[1][:-1], hist[0]) plt.title("Zipf law"); .. parsed-literal:: [ 5 2 18 1 7 26 1 2 1 1 1 1 11 2 1 6 1 2 1 8] .. image:: skewdata_reduce_3_1.png Usually, we use logarithmic scale on both sides, otherwise the plots often shows a white square. .. code:: ipython3 plt.loglog(hist[1][:-1], hist[0]) plt.title("Zipf law, logarithmic scale"); .. image:: skewdata_reduce_5_0.png Now, we try different :math:`s`: .. code:: ipython3 from numpy.random import zipf from numpy import histogram for s in [2,3,4]: v = zipf(s,2000) h = histogram(v, bins=range(1,max(v)+1)) plt.loglog(h[1][:-1],h[0],label="s=%1.0f"%s) plt.legend() plt.title("Zipf law, logarithmic scale for different s"); .. image:: skewdata_reduce_7_0.png What about the cost :math:`C` of the last step of the map/reduce job if the key :math:`w` we reduce on follows a Zipf’s law? We can express it as follows: .. math:: C= K \sum_i \frac{f(i)}{i^s} The cost :math:`f(i)` is usually polynomial: :math:`f(i) \propto i^t`. Then :math:`C \propto \sum_i i^{t-s}`. This sum can be computed only if :math:`t`__. Graphs are also a good example. The web is considered as a `scale-free network `__. It is nearly impossible to break a connected component by cutting an edge and the degree distribution follows a `power law `__. As a result, joining a set of edges against itself is always a challenge as it is very often that a key (= a vertex) is shared by a huge amount of edges (= the vertex is connected to many of them). These big vertices are called “hub”. Depending on what you need to compute on the graph, you might prune some edges or even remove the hubs (see `Small-world phenomenon `__).