Windowing: introduction

from the Artful Common Queries page


Added in MySQL 8 and MariaDB 10, SQL window-analytic functions extend and refine the concept of aggregation.

The windowing function OVER() works with all aggregate functions except GROUP_CONCAT()—essentially making them window-aware—and works also with window-analytic functions:

CUME_DIST() Cumulative distribution value
DENSE_RANK() Rank of current row in its window without gaps
FIRST_VALUE() Argument value from first row of window
LAG() Argument value from row lagging current row in window
LAST_VALUE() Argument value from last row of window
LEAD() Argument value from row leading current row in window
NTH_VALUE() Argument value from Nth row of window
NTILE() Bucket number of current row in its window
PERCENT_RANK() Cumulative percent rank of argument value
RANK() Current row rank within its window, with gaps
ROW_NUMBER() Current row number in its window

This is the toy table from http://mysqlserverteam.com/mysql-8-0-2-introducing-window-functions:

CREATE TABLE sales(employee VARCHAR(8), `date` DATE, sale INT);
INSERT INTO sales VALUES 
  ('odin', '2017-03-01', 200),('odin', '2017-04-01', 300),('odin', '2017-05-01', 400),
  ('thor', '2017-03-01', 400),('thor', '2017-04-01', 300),('thor', '2017-05-01', 500);

Aggregate functions suppress display of the rows they aggregate ...

SELECT SUM(sale) AS sum FROM sales;
+------+
| sum  |
+------+
| 2100 |
+------+

... but the OVER() windowing function undoes that row suppression, so individual rows that have been aggregated appear in the result:

SELECT employee, SUM(sale) OVER() AS sum FROM sales;
+----------+------+
| employee | sum  |
+----------+------+
| odin     | 2100 |
| odin     | 2100 |
| odin     | 2100 |
| thor     | 2100 |
| thor     | 2100 |
| thor     | 2100 |
+----------+------+

OVER() accepts two optional arguments, a PARTITION BY clause specifying how to partition row groups into windows, and an ORDER BY clause which orders window rows and makes their aggregate results row-by-row cumulative.

OVER() with only a PARTITION BY clause partitions aggregate results, but each row shows the same aggregate result for its window:
 
SELECT 
  employee, date, sale, 
  SUM(sale) OVER(PARTITION BY employee) AS sum
FROM sales;
+----------+------------+------+------+
| employee | date       | sale | sum  |
+----------+------------+------+------+
| odin     | 2017-03-01 |  200 |  900 |
| odin     | 2017-04-01 |  300 |  900 |
| odin     | 2017-05-01 |  400 |  900 |
| thor     | 2017-03-01 |  400 | 1200 |
| thor     | 2017-04-01 |  300 | 1200 |
| thor     | 2017-05-01 |  500 | 1200 |
+----------+------------+------+------+

As you'd expect, OVER() with only an ORDER BY clause orders by the ORDER BY clause, and accumulates by it:

SELECT employee, date, sale, SUM(sale) OVER(ORDER BY date) AS sum
FROM sales;
+----------+------------+------+------+
| employee | date       | sale | sum  |
+----------+------------+------+------+
| odin     | 2017-03-01 |  200 |  600 | (sum for 2017-03-01)
| thor     | 2017-03-01 |  400 |  600 |
| odin     | 2017-04-01 |  300 | 1200 | (sum for 2017-04-01)
| thor     | 2017-04-01 |  300 | 1200 |
| odin     | 2017-05-01 |  400 | 2100 | (sum for 2017-05-01)
| thor     | 2017-05-01 |  500 | 2100 |
+----------+------------+------+------+

As you might not expect, when there is no PARTITION BY clause, aggregate results repeat within each window. Why?

ORDER BY accepts two kinds of ordering, ROWS which means by physical row, and RANGE which means by value. RANGE is the default. Therefore ORDER BY x with no modifier, as in the above query, means ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

(How did SQL designers come to take UNBOUNDED to mean "begin with the first item found"? I'd give worlds to know.)

To get ORDER BY x to accumulate by each row value of x instead of by each distinct value found, specify the ROWS option instead:

SELECT 
  employee, sale, date,
  SUM(sale) OVER(ORDER BY date ROWS
                 BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                ) AS cum_sales
FROM sales;
+----------+------+------------+-----------+
| employee | sale | date       | cum_sales |
+----------+------+------------+-----------+
| odin     |  200 | 2017-03-01 |       200 |
| thor     |  400 | 2017-03-01 |       600 |
| odin     |  300 | 2017-04-01 |       900 |
| thor     |  300 | 2017-04-01 |      1200 |
| odin     |  400 | 2017-05-01 |      1600 |
| thor     |  500 | 2017-05-01 |      2100 |
+----------+------+------------+-----------+

We get the full potential of OVER() by giving it both PARTITION BY and ORDER BY arguments:

SELECT 
  employee, date, sale, 
  SUM(sale) OVER(PARTITION BY employee ORDER BY date) AS sum 
FROM sales;
+----------+------------+------+------+
| employee | date       | sale | sum  |
+----------+------------+------+------+
| odin     | 2017-03-01 |  200 |  200 | (odin sum starts to accumulate)
| odin     | 2017-04-01 |  300 |  500 |
| odin     | 2017-05-01 |  400 |  900 |
| thor     | 2017-03-01 |  400 |  400 | (thor sum starts to accumulate)
| thor     | 2017-04-01 |  300 |  700 |
| thor     | 2017-05-01 |  500 | 1200 |
+----------+------------+------+------+

Windowing functions

OVER() works with all aggregate functions except GROUP_CONCAT(), also with the windowing functions implemented since MySQL 8 and MariaDB 10. Mostly, these functions compute row ranking statistics, either across the whole resultset or within windows created by OVER().

Adding windowing functionality to a query involves adding at least two elements:

1. A call to any standard aggregating function except GROUP_CONCAT(), or to one of the windowing functions (listed in the table near the top of this article), and

2. A call to OVER(), specifying the rowset (or partition or window) to which the above aggregating or windowing function should be applied.

See "Calling window functions" below for info on calling these functions, and ample queries using them. The article "Within-group quotas (Top N per group)" illustrates how to use ROW_NUMBER() to rank rows within bands or rowsets.

Given the toy table ...

drop table if exists tbl;
create table tbl(foo char(1), bar int);
insert into tbl values
  ('a',7),('b',5),('c',1),('d',9),('e',6),
  ('f',3),('g',2),('h',8),('i',5),('j',0);

... this query illustrates some of the row-to-row logic of windowing functions, applied to the whole resultset code>win ...

SELECT
  bar as Value,
  ROW_NUMBER() OVER win AS 'Row_Number',
  LAG(bar) OVER win AS 'Lag',
  LAST_VALUE(bar) OVER win AS 'Last_Value',
  CUME_DIST() OVER win AS 'Cume_Dist',
  100*ROUND( PERCENT_RANK() OVER win, 1 ) AS 'Percent_Rank',
  DENSE_RANK() OVER win AS 'Dense_Rank'
FROM tbl
WINDOW win AS (ORDER BY bar);
+-------+------------+------+------------+-----------+--------------+------------+
| Value | Row_Number | Lag  | Last_Value | Cume_Dist | Percent_Rank | Dense_Rank |
+-------+------------+------+------------+-----------+--------------+------------+
|     0 |          1 | NULL |          0 |       0.1 |          0.0 |          2 |
|     1 |          2 |    0 |          1 |       0.2 |         10.0 |          3 |
|     2 |          3 |    1 |          2 |       0.3 |         20.0 |          4 |
|     3 |          4 |    2 |          3 |       0.4 |         30.0 |          5 |
|     5 |          5 |    3 |          5 |       0.6 |         40.0 |          6 |
|     5 |          6 |    5 |          5 |       0.6 |         40.0 |          6 |
|     6 |          7 |    5 |          6 |       0.7 |         70.0 |          7 |
|     7 |          8 |    6 |          7 |       0.8 |         80.0 |          8 |
|     8 |          9 |    7 |          8 |       0.9 |         90.0 |          9 |
|     9 |         10 |    8 |          9 |         1 |        100.0 |         10 |
+-------+------------+------+------------+-----------+--------------+------------+

Here's another banding example: the toy table c has multiple values of pid per value of id:

create table c ( id smallint primary key auto_increment, pid smallint );
insert into c values(1,1),(2,1),(3,2),(4,2),(5,2),(6,3);

A combination of aggregation with windowing functions retrieves the count, the first id value per pid, and the last id value per pid, without GROUP BY:

select distinct 
  pid, 
  count(*)        over (partition by pid) as N, 
  first_value(id) over (partition by pid) as 'first', 
  last_value(id)  over (partition by pid) as 'last' 
from c 
order by pid;
+------+---+-------+------+
| pid  | N | first | last |
+------+---+-------+------+
|    1 | 2 |     1 |    2 |
|    2 | 3 |     3 |    5 |
|    3 | 1 |     6 |    6 |
+------+---+-------+------+

Calling analytic functions

Many of these function take no arguments. The few that do mostly accept an expr argument, a column expression that evaluates to a scalar; other arguments are as shown, brackets indicate the argument is optional.

CUME_DIST(): Returns cumulative distribution between 0 and 1

DENSE_RANK(): Returns 1 + count of distinct ranks before current row

FIRST_VALUE(): Returns argument value from first row of window

LAG(expr[, offset[, nullexpr]]): Returns value of expr in row lagging current window row by offset rows; show nullexpr for nulls

LAST_VALUE(): Returns argument value from last row of window

LEAD(expr[, offset[, nullexpr]]): Returns value of expr in row leading current window row by offset rows; show nullexpr for nulls

NTH_VALUE(expr, n): Returns value of expr in nth window row

NTILE(n): Returns bucket count of current window row if there are n buckets

PERCENT_RANK(): Returns cumulative percent argument rank value

RANK(): Returns current row rank within its window, with gaps

ROW_NUMBER(): Returns current row number in its window

This query on the sales table illustrates some basics of their use:

SELECT 
  employee as name, date, sale, 
  ROW_NUMBER()              OVER(PARTITION BY employee ORDER BY date) AS 'Row', 
  NTILE(2)                  OVER(PARTITION BY employee ORDER BY date) AS 'Tile', 
  FIRST_VALUE(sale)         OVER(PARTITION BY employee ORDER BY date) AS '1stVal', 
  NTH_VALUE(sale,1)         OVER(PARTITION BY employee ORDER BY date) AS '2ndVal', 
  LAST_VALUE(sale)          OVER(PARTITION BY employee ORDER BY date) AS 'LastVal', 
  LAG(sale,1,'None')        OVER(PARTITION BY employee ORDER BY date) AS 'Lag', 
  LEAD(sale,1,'None')       OVER(PARTITION BY employee ORDER BY date) AS 'Lead', 
  ROUND( CUME_DIST()        OVER(PARTITION BY employee ORDER BY date), 1 ) AS 'CumDist', 
  100*ROUND( PERCENT_RANK() OVER(PARTITION BY employee ORDER BY date), 1 ) AS 'PctRnk', 
  DENSE_RANK()              OVER(PARTITION BY employee ORDER BY date) AS 'DRnk' 
FROM sales;
+------+------------+------+-----+------+--------+--------+---------+------+------+---------+--------+------+
| name | date       | sale | Row | Tile | 1stVal | 2ndVal | LastVal | Lag  | Lead | CumDist | PctRnk | DRnk |
+------+------------+------+-----+------+--------+--------+---------+------+------+---------+--------+------+
| odin | 2017-03-01 |  200 |   1 |    1 |    200 |    200 |     200 | None | 300  |     0.3 |    0.0 |    1 |
| odin | 2017-04-01 |  300 |   2 |    1 |    200 |    200 |     300 | 200  | 400  |     0.7 |   50.0 |    2 |
| odin | 2017-05-01 |  400 |   3 |    2 |    200 |    200 |     400 | 300  | None |     1.0 |  100.0 |    3 |
| thor | 2017-03-01 |  400 |   1 |    1 |    400 |    400 |     400 | None | 300  |     0.3 |    0.0 |    1 |
| thor | 2017-04-01 |  300 |   2 |    1 |    400 |    400 |     300 | 400  | 500  |     0.7 |   50.0 |    2 |
| thor | 2017-05-01 |  500 |   3 |    2 |    400 |    400 |     500 | 300  | None |     1.0 |  100.0 |    3 |
+------+------------+------+-----+------+--------+--------+---------+------+------+---------+--------+------+


Return to the Artful Common Queries page