Correlation

from the Artful Common Queries page


Probably more bunk has been written about correlation than about any other statistic. We'll keep this short and straight. At its simplest, correlation is a statistical measure of non-random, linear association between pairs of values in a dataset. It's denoted by r, and varies from -1 through +1, where -1 indicates perfect inverse correlation (the regression line goes down left to right), 0 indicates no correlation (there is no regression line; it's just a scatterplot), and +1 indicates perfect direct correlation (the regression line goes up left to right).

For an example we'll use a bit of imaginary data:

drop table if exists xydata;
create table xydata (id int, x int, y float);
insert into xydata values
(1 , 68, 4.1),(2 , 71, 4.6),(3 , 62, 3.8),(4 , 75, 4.4),(5 , 58, 3.2),
(6 , 60, 3.1),(7 , 67, 3.8),(8 , 68, 4.1),(9 , 71, 4.3),(10, 69, 3.7),
(11, 68, 3.5),(12, 67, 3.2),(13, 63, 3.7),(14, 62, 3.3),(15, 60, 3.4),
(16, 63, 4.0),(17, 65, 4.1),(18, 67, 3.8),(19, 63, 3.4),(20, 61, 3.6);

If you like to think about such problems concretely, you can think of id as a subject's id, x as a subject's height, and y as a subject's score on a self-confidence questionnaire, so we would be computing a correlation between height and self-confidence.

There are many correlation formulas. Most commonly used is the Pearson product-moment correlation coefficient, which is valid only for normally distributed data (data that roughly fits a bell curve). When the data skews significantly from a normal distribution, you very likely need a different correlation method.

You will also want the slope of the relationship or regression line, its intercept with the Y axis, and the coefficient of regression, which is the proportion of observed variation due to correlation.

And, you need to calculate the probability that the coefficient you calculated is significantly different from zero. Here we use a simple standard formula to calculate a t statistic for the correlation result. The Wikipedia page on correlation describes several ways of testing the significance of correlations.

For this example we restrict ourselves to calculating correlation assuming a normal distribution and no missing values.

Calculating the correlation coefficient needs two passes: a first pass to calculate basic statistical quantities, then a second pass to calculate the slope, intercept and correlation coefficient from those basic quantities. Calculating the coefficient of regression and statistical significance needs a third pass. So this will be a 4-step:

Step 1: Calculate the required basic statistics.
Step 2: Use the results of Step 1 to calculate slope, intercept and r.
Step 3: Run the slope and intercept from Step 2 against xydata.y values to calculate the coefficient of regression.
Step 4: Collect the result, calculate t and degrees of freedom (df).

If we're computing correlation in a database like MySQL, then, you'd expect Views will be useful. Unfortunately we're immediately bitten by two limitations in the MySQL implementation of Views: subqueries in the FROM clause of a View are not supported, and neither are parameters.

Then do we need to encapsulate correlation in a stored procedure? We could, but we needn't because SQL implementations like MySQL provide the required basic statistical quantities, permitting us to do it all in one four-layer query ...

Step 1 is the innermost query, returning one row of summary statistics,

Step 2 collects and formats the Step 1 result row,

Step 3 calculates the regression coefficient by cross-joining the Step 2 query result with the original table.

Step 4, the outermost query, applies a simple formula for the t statiatic with N-2 degrees of freedom.

Read the query inside out:

SELECT                                       -- Step 4
  N, Slope, avgY - slope*avgX AS Intercept, r AS Correlation, 
  CoeffOfReg, Round( r * SqRt( (N-2)/(1-r*r) ), 2 ) AS t, N-2 AS df
FROM (
  SELECT                                     -- Step 3
    N, avgX, avgY, slope, intercept, r, 
    FORMAT( 1 - SUM((y - intercept - slope*x)*(y - intercept - slope*x))/
            ((N-1)*varY), 5 ) AS CoeffOfReg 
  FROM xydata
  CROSS JOIN (
    SELECT                                   -- Step 2
      N, avgX, avgY, varY, slope,
      r, avgY - slope*avgX AS intercept
    FROM (
      SELECT
        N, avgX, avgY, varY,
        FORMAT(( N*sumXY - sumX*sumY ) / 
               ( N*sumsqX - sumX*sumX ), 5 )           AS slope,
        FORMAT(( sumXY - n*avgX*avgY ) / 
               ( (N-1) * SQRT(varX) * SQRT(varY)), 5 ) AS r
      FROM (
        SELECT                               -- Step 1
          COUNT(x)    AS N,
          AVG(x)      AS avgX,
          SUM(x)      AS sumX,
          SUM(x*x)    AS sumsqX,
          VAR_SAMP(x) AS varX,
          AVG(y)      AS avgY,
          SUM(y)      AS sumY,
          SUM(y*y)    AS sumsqY,
          VAR_SAMP(y) AS varY,
          SUM(x*y)    AS sumXY
        FROM xydata
      ) AS sums
    ) AS calc
  ) AS stats
) AS result;
+------+---------+---------------------+-------------+------------+------+------+
| N    | Slope   | Intercept           | Correlation | CoeffOfReg | t    | df   |
+------+---------+---------------------+-------------+------------+------+------+
|   20 | 0.07066 | -0.8661640047683719 | 0.73064     | 0.53383    | 4.54 |   18 |
+------+---------+---------------------+-------------+------------+------+------+

You can look up the significance of the t value at a pre-decided level of significance for the number of degrees of freedom in your sample in a standard statistical table of t values, for example this one. As you can see there, the higher the degrees of freedom, the lower the t value needed for a given level of significance.

Return to the Artful Common Queries page