Correlation

from the Artful Common Queries page


Probably more bunk has been written about correlation than about any other statistic. We'll keep this short and straight. At its simplest, correlation is a statistical measure of non-random, linear association between pairs of values in a dataset. It's denoted by r, and varies from -1 through +1, where -1 indicates perfect inverse correlation (the regression line goes down left to right), 0 indicates no correlation (there is no regression line; it's just a scatterplot), and +1 indicates perfect direct correlation (the regression line goes up left to right).

For an example we'll use a bit of imaginary data:

drop table if exists t;
create table t (id int, x int, y float);
insert into t values
(1 , 68, 4.1),(2 , 71, 4.6),(3 , 62, 3.8),(4 , 75, 4.4),(5 , 58, 3.2),
(6 , 60, 3.1),(7 , 67, 3.8),(8 , 68, 4.1),(9 , 71, 4.3),(10, 69, 3.7),
(11, 68, 3.5),(12, 67, 3.2),(13, 63, 3.7),(14, 62, 3.3),(15, 60, 3.4),
(16, 63, 4.0),(17, 65, 4.1),(18, 67, 3.8),(19, 63, 3.4),(20, 61, 3.6);

If you like to think about such problems concretely, you can think of id as a subject's id, x as a subject's height, and y as a subject's score on a self-confidence questionnaire, so we would be computing a correlation between height and self-confidence.

There are many correlation formulas. Most commonly used is the Pearson product-moment correlation coefficient, which is valid only for normally distributed data (data which roughly fits a bell curve). A good Wikipedia page offers several formulas for the Pearson coefficient.

Note that when the data skews significantly from a normal distribution, you very likely need a different formula. That's one problem with correlation: it isn't always a valid number to calculate.

Another problem is that computing just the correlation coefficient is never enough. You also need to find the probability that the coefficient you calculated is significantly different from 0. The Wikipedia page on correlation describes several ways of testing the significance of correlations.

You will probably also want the slope of the relationship or regression line, its intercept with the Y axis, and the coefficient of regression, which is the proportion of observed variation due to correlation.

For this example we restrict ourselves to calculating correlation assuming a normal distribution and no missing values.

Calculation of the correlation coefficient needs two passes: a first to calculate basic statistical quantities, then a second to calculate the slope, intercept and correlation coefficient from those basic quantities. Calculating the coefficient of regression requires a third pass:

1. Calculate the required basic statistics.
2. Use the results of #1 to calculate slope, intercept and r.
3. Use the slope and intercept from #2 to calculate the coefficient of regression.
4. Collect and display the results.

If we're computing correlation in a database like MySQL, then, you'd expect Views will be useful. Unfortunately we're immediately bitten by two limitations in the MySQL implementation of Views: subqueries in the FROM clause of a View are not supported, and neither are parameters.

Then do we need to encapsulate correlation in a stored procedure? We could, but we needn't because SQL implementations like MySQL provide the required basic statistical quantities, permitting us to do it all in one four-layer query ...

1. The innermost query does step 1,
2. The query that references the inner query does step 2,
3. The next outer query calculates the regression coefficient by cross-joining the step 2 query result, which is one logical row, with the original table.
4. The outermost query displays the result:

Read the query inside out:

SELECT                                       -- Step 4
  N, Slope, avgY - slope*avgX AS Intercept, 
  Correlation, CoeffOfReg
FROM (
  SELECT                                     -- Step 3
    N, avgX, avgY, slope, intercept, Correlation, 
    FORMAT( 1 - SUM((y - intercept - slope*x)*(y - intercept - slope*x))/
            ((N-1)*varY), 5 ) AS CoeffOfReg
  FROM t AS t2
  JOIN (
    SELECT                                   -- Step 2
      N, avgX, avgY, varY, slope,
      Correlation, avgY - slope*avgX AS intercept
    FROM (
      SELECT
        N, avgX, avgY, varY,
        FORMAT(( N*sumXY - sumX*sumY ) / 
               ( N*sumsqX - sumX*sumX ), 5 )           AS slope,
        FORMAT(( sumXY - n*avgX*avgY ) / 
               ( (N-1) * SQRT(varX) * SQRT(varY)), 5 ) AS Correlation
      FROM (
        SELECT                               -- Step 1
          COUNT(x)    AS N,
          AVG(x)      AS avgX,
          SUM(x)      AS sumX,
          SUM(x*x)    AS sumsqX,
          VAR_SAMP(x) AS varX,
          AVG(y)      AS avgY,
          SUM(y)      AS sumY,
          SUM(y*y)    AS sumsqY,
          VAR_SAMP(y) AS varY,
          SUM(x*y)    AS sumXY
        FROM t
      ) AS sums
    ) AS calc
  ) stats
) combined;
+----+---------+---------------------+-------------+------------+
| N  | Slope   | Intercept           | Correlation | CoeffOfReg |
+----+---------+---------------------+-------------+------------+
| 20 | 0.07066 | -0.8661640047683719 | 0.73064     | 0.53383    |
+----+---------+---------------------+-------------+------------+


Return to the Artful Common Queries page