Running sums with & without CTEs

from the Artful Common Queries page


Straight SQL isn't designed for row-to-row calculations---they can be done, but they're seriously inefficient, e.g., given a chequebook-like table tbl( id int primary key,d date,value decimal(10,2), key(date) ), this query will return a running balance ...

select  
  d, value, 
  (select sum(value) as balance
   from tbl as running
   where running.d <= tbl.d 
  ) as balance 
from tbl
order by d;

... but at the cost of recomputing the running sum from the top for every row. On a big table, even with a date index, that'll be slow.

Originally cursors were designed to remedy the awkwardness and slowness of SQL with row-to-row computations, but cursors are much slower than straight SQL.

A user variable can sidestep these inefficiencies by maintaining the running total in memory. Initialise the variable, then adjust its value as desired in the appropriate SELECT expression:

SET @balancel=0;
SELECT d, value, @balance:=@balance+value AS RunningSum
FROM tbl
ORDER BY d;

This example tracks the running balance and how much money from the most recent deposit remains ...

DROP TABLE IF EXISTS chequebook;
CREATE TABLE chequebook (
  entry_date timestamp default now() PRIMARY KEY,
  entry_item varchar(48) NOT NULL DEFAULT '',
  entry_amount decimal(10,2) NOT NULL DEFAULT 0.00
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO chequebook (entry_date,entry_item,entry_amount) VALUES 
('2010-01-02 12:34:00','Deposit A',215.56),
('2010-01-02 21:44:00','Withdrawal A' ,-23.34),
('2010-01-03 10:44:00','Withdrawal B',-150.15),
('2010-01-03 15:44:00','Deposit B',154.67),
('2010-01-04 18:44:00','Withdrawal C',-65.09),
('2010-01-05 08:44:00','Withdrawal D',-74.23),
('2010-01-06 14:44:00','Deposit C',325.12),
('2010-01-06 20:44:00','Withdrawal E',-80.12),
('2010-01-07 04:44:00','Withdrawal F',-110.34),
('2010-01-07 16:44:00','Withdrawal G',-150.25),
('2010-01-08 16:44:00','Withdrawal H',-23.90),
('2010-01-08 21:44:00','Withdrawal I',-75.66),
('2010-01-08 22:44:00','Deposit C',275.78),
('2010-01-09 11:44:00','Withdrawal K',-85.99),
('2010-01-09 21:44:00','Withdrawal J',-100.00);
 
set @depos=0, @total=0;
select 
  entry_date, 
  entry_item, 
  entry_amount, 
  if( entry_amount>0, @depos:=entry_amount, @depos:=@depos+entry_amount ) as depos_bal,
  @total:=@total+entry_amount as net_bal
from chequebook
order by entry_date;
+---------------------+--------------+--------------+-----------+---------+
| entry_date          | entry_item   | entry_amount | depos_bal | net_bal |
+---------------------+--------------+--------------+-----------+---------+
| 2010-01-02 12:34:00 | Deposit A    |       215.56 |    215.56 |  215.56 |
| 2010-01-02 21:44:00 | Withdrawal A |       -23.34 |    192.22 |  192.22 |
| 2010-01-03 10:44:00 | Withdrawal B |      -150.15 |     42.07 |   42.07 |
| 2010-01-03 15:44:00 | Deposit B    |       154.67 |    154.67 |  196.74 |
| 2010-01-04 18:44:00 | Withdrawal C |       -65.09 |     89.58 |  131.65 |
| 2010-01-05 08:44:00 | Withdrawal D |       -74.23 |     15.35 |   57.42 |
| 2010-01-06 14:44:00 | Deposit C    |       325.12 |    325.12 |  382.54 |
| 2010-01-06 20:44:00 | Withdrawal E |       -80.12 |    245.00 |  302.42 |
| 2010-01-07 04:44:00 | Withdrawal F |      -110.34 |    134.66 |  192.08 |
| 2010-01-07 16:44:00 | Withdrawal G |      -150.25 |    -15.59 |   41.83 |
| 2010-01-08 16:44:00 | Withdrawal H |       -23.90 |    -39.49 |   17.93 |
| 2010-01-08 21:44:00 | Withdrawal I |       -75.66 |   -115.15 |  -57.73 |
| 2010-01-08 22:44:00 | Deposit C    |       275.78 |    275.78 |  218.05 |
| 2010-01-09 11:44:00 | Withdrawal K |       -85.99 |    189.79 |  132.06 |
| 2010-01-09 21:44:00 | Withdrawal J |      -100.00 |     89.79 |   32.06 |
+---------------------+--------------+--------------+-----------+---------+

If your platform does not permit multiple queries per connection, and if you can tolerate the O(N2) inefficiency, a self-join does the same job as an accumulating user variable ...

SELECT c.id, c.value, d.RunningSum
FROM tbl c
JOIN (
  SELECT a.id, SUM(b.value) AS RunningSum
  FROM tbl a
  LEFT JOIN tbl b ON b.id <= a.id
  GROUP BY a.id
) d USING (id);

Accumulating a sum and resetting it (ie subgrouping it) on another column value is a little harder ...

drop table if exists t;
create table t( id smallint primary key, a char(1), v int );
insert into t values(1,'a',10),(2,'b',20),(3,'b',30),(4,'c',40);

select x.a, x.v, sum(x.v) as cum
from t x 
join t y on y.a = x.a and y.id <= x.id 
group by x.id 
order by x.a, x.id;
+------+------+------+
| a    | v    | cum  |
+------+------+------+
| a    |   10 |   10 |
| b    |   20 |   20 |
| b    |   30 |   60 |
| c    |   40 |   40 |
+------+------+------+

Since MySQL 8.0.12 we can do row-to-row calculations with using any of the above algorithms; here is a running sum CTE using the Row_Number() function to generate a sequential row id and the join algorithm for accumulating a sum...

with recursive                         -- so `x` can self-reference
  y as (                               -- sequential row id from row_number()
    select d, value, row_number() over(order by d) as rn
    from tbl
  ), 
  x as (                               -- cumulative sum
    select d, rn, value, value as balance  
    from y
    where rn=1                         -- first 'seed' row
    union all
    select y.d, y.rn, y.value, x.balance + y.value as balance
    from x 
    join y on y.rn=x.rn+1              -- `x` walks itself
  )
select d, value, balance               -- display result
from x
order by x.rn;


Return to the Artful Common Queries page