## Sum for time periods

### from the Artful Common Queries page

 A table tracks attendance at some location: ```drop table if exists t; create table t(interval_id int,start datetime,end datetime, att int); insert into t values (1,'2007-01-01 08:00:00','2007-01-01 12:00:00',5 ), (2,'2007-01-01 13:00:00','2007-01-01 17:00:00',10), (3,'2007-01-01 10:00:00','2007-01-01 15:00:00',15), (4,'2007-01-01 14:00:00','2007-03-07 19:00:00',20); select * from t; +-------------+---------------------+---------------------+------+ | interval_id | start | end | att | +-------------+---------------------+---------------------+------+ | 1 | 2007-01-01 08:00:00 | 2007-01-01 12:00:00 | 5 | | 2 | 2007-01-01 13:00:00 | 2007-01-01 17:00:00 | 10 | | 3 | 2007-01-01 10:00:00 | 2007-01-01 15:00:00 | 15 | | 4 | 2007-01-01 14:00:00 | 2007-03-07 19:00:00 | 20 | +-------------+---------------------+---------------------+------+ ``` In this table, `att` is a delta: it tracks entrances. Actual attendance is `SUM(att)` at any given moment. For example, if the attendance sum is `x` at a given moment, then after we add a row with `att=y` and start/end datetimes embracing that moment, attendance will be `x+y`. So to retrieve total attendance at 01330h on 1 Jan 2007, we write: ```SELECT SUM(att) FROM t WHERE t.start <= '2007-01-01 13:30:00' AND t.end >= '2007-01-01 13:30:00'; +----------+ | SUM(att) | +----------+ | 25 | +----------+ ``` Then how would we extract maximum attendance during a given period, for example, maximum attendance between 1300h and 1700h? SQL isn't efficient with time. Some SQL dialects offer time series enhancements to the language; MySQL does not. And, querying time series data for aggregate statistics gets complicated very quickly. It gets a bit simpler with a calendar table that has a row for every possible datetime value. For our example, assume a granularity of one hour and a query period of one day. Naturally a real system would require a range of dates and perhaps a finer time granularity: ```create table cal(id int,dt datetime); insert into cal values(1,'2007-1-1 01:00:00'); insert into cal values(2,'2007-1-1 02:00:00'); insert into cal values(3,'2007-1-1 03:00:00'); insert into cal values(4,'2007-1-1 04:00:00'); insert into cal values(5,'2007-1-1 05:00:00'); insert into cal values(6,'2007-1-1 06:00:00'); insert into cal values(7,'2007-1-1 07:00:00'); insert into cal values(8,'2007-1-1 08:00:00'); insert into cal values(9,'2007-1-1 09:00:00'); insert into cal values(10,'2007-1-1 10:00:00'); insert into cal values(11,'2007-1-1 11:00:00'); insert into cal values(12,'2007-1-1 12:00:00'); insert into cal values(13,'2007-1-1 13:00:00'); insert into cal values(14,'2007-1-1 14:00:00'); insert into cal values(15,'2007-1-1 15:00:00'); insert into cal values(16,'2007-1-1 16:00:00'); insert into cal values(17,'2007-1-1 17:00:00'); insert into cal values(18,'2007-1-1 18:00:00'); insert into cal values(19,'2007-1-1 19:00:00'); insert into cal values(20,'2007-1-1 20:00:00'); insert into cal values(21,'2007-1-1 21:00:00'); insert into cal values(22,'2007-1-1 22:00:00'); insert into cal values(23,'2007-1-1 23:00:00'); insert into cal values(24,'2007-1-1 24:00:00'); ``` To accumulate the maximum attendance sum, collect target values for defined periods in an inner query, and sum them from the outer query: ```SELECT SUM( att ) FROM ( SELECT t.start AS PeriodStart, t.end AS PeriodEnd, MIN(cal.dt) + INTERVAL 1 HOUR AS CountBegin, MAX(cal.dt) AS CountEnd, t.att FROM t JOIN cal ON cal.dt >= t.start AND cal.dt < t.end GROUP BY PeriodStart, PeriodEnd HAVING CountBegin < '2007-01-01 17:00:00' AND CountEnd > '2007-01-01 11:00:00' ) AS periods; +------------+ | SUM( att ) | +------------+ | 45 | +------------+ ``` If the data is more complicated, eg if we also need to track exits, the period logic needs refinement but the principle remains the same. Here is a variation on the period sums problem posted in a MySQL forum. Multiple daily measurements are taken at variable times, and must be summed for the 24-hour period starting at 4pm each day. ```drop table if exists t; create table t( id smallint, Name char(16), timestamp timestamp(3), Value1 decimal(20,8), Value2 decimal(20,8) ); insert into t values (1, 'Marco', '2009-05-04 16:01:01.582', 0.0000000, -0.0000400 ), (2, 'James' , '2009-05-04 23:01:01.582', -0.0001400, 0.0000000 ), (3, 'Louise', '2009-05-05 00:00:00.833', -0.0001100, -0.0001400 ), (4, 'Martin', '2009-05-05 01:01:01.582', -0.0000400, -0.0001100 ), (5, 'Jamie', '2009-05-05 02:01:01.582', 0.0001100, -0.0000400 ), (6, 'Cecilie', '2009-05-05 03:01:01.582', 0.0000500, 0.0001100 ), (7, 'Ben', '2009-05-05 15:01:01.582', 0.0000000, 0.0000500 ), (8, 'Amanda', '2009-05-05 16:01:01.582', -0.0000800, 0.0000000 ), (9, 'Ida', '2009-05-05 23:01:01.582', 0.0000000, -0.0000800 ); ``` How to identify values as belonging within the desired intervals? A summing period begins at 4pm on a given date, and ends at 15:59:59.999 on the next day, so for a given day, the time boundaries are ... start: `concat( date(timestamp), ' 16:00:00:000' )` stop: `concat( date(timestamp)+interval 1 day, ' 15:59:59:999' )` For optimal performance, write a simple query that creates such a "calendar" table so you can efficiently join from it to your data table, eg ... ```drop table if exists cal; create table cal select distinct concat( date(timestamp), ' 16:00:00:000' ) as tbegin, concat(date(timestamp)+interval 1 day, ' 15:59:59:999' ) as tend from t; ``` Here we derive the `cal` table dynamically. Once we have a SQL representation of the summing periods, ie as the columns cal.tbegin and cal.tend, all we need do is join from the calendar to the data table `on timestamp between cal.tbegin and cal.tend`, and `Group By tbegin, tend`: ```select cal.tbegin, cal.tend, sum(vals.value1), sum(vals.value2) from ( select distinct concat( date(timestamp), ' 16:00:00:000' ) as tbegin, concat(date(timestamp)+interval 1 day, ' 15:59:59:999' ) as tend from t ) as cal join ( select timestamp,value1,value2 from t ) as vals on timestamp between cal.tbegin and cal.tend group by cal.tbegin, cal.tend; ``` What if the summing periods may be multiple days? That complicates construction of the "calendar". Add a row for a multi-date summing period ... ```insert into t values (10, 'Nino', '2009-07-04 17:01:01.582', 0.0005000, -0.0006000 ); ``` The rows that mark summing boundaries are rows with ... (i) the smallest date > the current row's date, and (ii) the smallest time value >= 4pm ... In SQL that's ... ```select date(a.timestamp) as startdate, min(b.timestamp) as nextdate from t as a left join t as b on date(b.timestamp) > date(a.timestamp) and time(b.timestamp) >= '16:00:00' group by startdate; +------------+-------------------------+ | startdate | nextdate | +------------+-------------------------+ | 2009-05-04 | 2009-05-05 16:01:01.582 | | 2009-05-05 | 2009-07-04 17:01:01.582 | | 2009-07-04 | NULL | +------------+-------------------------+ ``` Then the summing boundaries are those dates except for the last nextdate, which is necessarily null, and which we need to replace with the date of the following day. So the summing boundaries are ... ```select concat( tmp.startdate, ' 16:00:00.000' ) as tbegin, concat( ifnull(tmp.nextdate,date(tmp.startdate)+interval 1 day), ' 15:59:59.999' ) as tend from ( select date(a.timestamp) as startdate, date( min(b.timestamp) ) as nextdate from t as a left join t as b on date(b.timestamp) > date(a.timestamp) and time(b.timestamp) >= '16:00:00' group by startdate ) as tmp; ``` ... so create a temp calendar table from that query ... ```drop table if exists cal; create table cal select concat( tmp.startdate, ' 16:00:00.000' ) as tbegin, concat( ifnull(tmp.nextdate,date(tmp.startdate)+interval 1 day), ' 15:59:59.999' ) as tend from ( select date(a.timestamp) as startdate, date( min(b.timestamp) ) as nextdate from t as a left join t as b on date(b.timestamp) > date(a.timestamp) and time(b.timestamp) >= '16:00:00' group by startdate ) as tmp; ``` ... and the query becomes .. ```select cal.tbegin, cal.tend, sum(vals.value1) as val1sum, sum(vals.value2) as val2sum from cal join ( select timestamp,value1,value2 from t ) as vals on timestamp between cal.tbegin and cal.tend group by cal.tbegin, cal.tend; +-------------------------+-------------------------+------------------+------------------+ | tbegin | tend | sum(vals.value1) | sum(vals.value2) | +-------------------------+-------------------------+------------------+------------------+ | 2009-05-04 16:00:00.000 | 2009-05-05 15:59:59.999 | -0.00013000 | -0.00017000 | | 2009-05-05 16:00:00.000 | 2009-07-04 15:59:59.999 | -0.00008000 | -0.00008000 | | 2009-07-04 16:00:00.000 | 2009-07-05 15:59:59.999 | 0.00050000 | -0.00060000 | +-------------------------+-------------------------+------------------+------------------+ ```Last updated 16 Aug 2019