Friday, 2 August 2013

Correlation, Regression and how to Destroy Information.




(The above image is from an article by Felix Salomon - 23/2/2009).
When a continuous domain is transferred onto another continuous domain, the process is called transformation

When a discrete domain is transferred onto another discrete domain, the process is called mapping

But when a discrete domain is transferred onto a continuous domain, what is the process called? Not clear, but in such a process information is destroyed. Regression is an example. Discrete (often expensive to get) data is used to build a function that fits the data, after which the data is gently removed and life continues on the smooth and differentiable function (or surface) to the delight of mathematicians. Typically,  democratic-flavoured approaches such as Least Squares are adopted to perpetrate the crime.

The reason we call Least Squares (and other related methods) "democratic" (in democracy everyone gets one vote, even assassins who get re-inserted into society, just as respectful hard-working and law-observing citizens) is that every point contributes to the construction of the mentioned best-fit function in equal measure. In other words, data points sitting in a cluster are treated equally with dispersed points. All that matters is the vertical distance from the sought best-fit function.

Finally, we have the icing on the cake: correlation. Look at the figure below, depicting two sets of points lying along a straight line.



The regression model is the same in each case. The correlations too! But how can that be? These two cases correspond to two totally different situations. The physics needed to distribute points evenly is not the same which makes them cluster into two groups. And yet in both cases stats yields a 100% correlation coefficient without distinguishing between two evidently different situations. What's more, in the void between the two clusters one cannot use the regression model just like that.  Assuming continuity a-priori can come at a heavy price.

Clearly this is a very simple example. The point, however, is that not many individuals out there are curious enough to look a bit deeper into data (yes, even visually!) and ask basic questions when using statistics or other methods.

By the way, "regression" is defined (Merriam Webster Dictionary) as "trend or shift to a lower or less perfect state". Indeed, when you kill information - replacing the original data with a best-fit line - this is all you can expect.