Within-group quotas (Top N per group)

from the Artful Common Queries page


A table has multiple rows per key value, and you need to retrieve, say, the first or earliest two rows per key. For example:

DROP TABLE IF EXISTS test;
CREATE TABLE test( id INT, entrydate DATE );
INSERT INTO test VALUES
( 1, '2007-5-01' ),( 1, '2007-5-02' ),( 1, '2007-5-03' ),( 1, '2007-5-04' ),
( 1, '2007-5-05' ),( 1, '2007-5-06' ),( 2, '2007-6-01' ),( 2, '2007-6-02' ),
( 2, '2007-6-03' ),( 2, '2007-6-04' ),( 3, '2007-7-01' ),( 3, '2007-7-02' ),
( 3, '2007-7-03' );

In MySQL since version 8.0.2 and MariaDB since version 10.2.2, windowing with Row_Number() allows this straightforward solution:

SELECT id, entrydate, `rank`
FROM ( 
  SELECT 
    id, entrydate, 
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY id,entrydate) AS `rank`
  FROM test 
  ORDER BY id 
) AS tmp 
WHERE tmp.`rank` <= 2 
ORDER BY id, entrydate; 
+------+------------+------+
| id   | entrydate  | rank |
+------+------------+------+
|    1 | 2007-05-01 |    1 |
|    1 | 2007-05-02 |    2 |
|    2 | 2007-06-01 |    1 |
|    2 | 2007-06-02 |    2 |
|    3 | 2007-07-01 |    1 |
|    3 | 2007-07-02 |    2 |
+------+------------+------+

Note that since there's now a Rank() function, rank is a reserved word needing backticks round it.

What if you can't yet get to MySQL 8 or MariaDB 10? One approach is to rank rows with user variables and pick off the top two for each key in the WHERE clause:

SELECT id, entrydate, `rank`
FROM (
  SELECT
    id, entrydate,
    IF( @prev <> id, @rownum := 1, @rownum := @rownum+1 ) AS `rank`,
    @prev := id
  FROM test.test
  JOIN (SELECT @rownum := NULL, @prev := 0) AS r
  ORDER BY id, entrydate
) AS tmp
WHERE tmp.`rank` <= 2
ORDER BY id, entrydate;

This is pretty much the same query pattern as the user variable Row_Number() emulation method described in the "Row_Number()" entry. The join in the subquery is just a device for resetting the variables after reading a row.

How do Row_Number() and the user variable method compare in performance? They both return correct results for 1,000 rows in less than a hundredth of a second on a modest machine; Row_Number() is slightly faster but you'll need to be reading millions of rows to notice the difference.

If the groups are fairly small, another feasible approach is to self-join and count. With appropriate ordering, the first two rows per ID are the rows which, for a given ID, have two or fewer rows with earlier dates. If we use an inequality join with the COUNT(*) function to find the earlier rows per ID ...

SELECT t1.id, t1.entrydate, COUNT(*) AS earlier
FROM test AS t1
JOIN test AS t2 ON t1.id=t2.id AND t1.entrydate >= t2.entrydate
GROUP BY t1.id, t1.entrydate
+------+------------+---------+
| id   | entrydate  | earlier |
+------+------------+---------+
|    1 | 2007-05-01 |       1 |
|    1 | 2007-05-02 |       2 |
|    1 | 2007-05-03 |       3 |
|    1 | 2007-05-04 |       4 |
|    1 | 2007-05-05 |       5 |
|    1 | 2007-05-06 |       6 |
|    2 | 2007-06-01 |       1 |
|    2 | 2007-06-02 |       2 |
|    2 | 2007-06-03 |       3 |
|    2 | 2007-06-04 |       4 |
|    3 | 2007-07-01 |       1 |
|    3 | 2007-07-02 |       2 |
|    3 | 2007-07-03 |       3 |
+------+------------+---------+

... then we get our result by removing rows where the 'earlier' count exceeds 2:

SELECT t1.id, t1.entrydate, count(*) AS earlier
FROM test AS t1
JOIN test AS t2 ON t1.id=t2.id AND t1.entrydate >= t2.entrydate
GROUP BY t1.id, t1.entrydate
HAVING earlier <= 2;
+------+------------+---------+
| id   | entrydate  | earlier |
+------+------------+---------+
|    1 | 2007-05-01 |       1 |
|    1 | 2007-05-02 |       2 |
|    2 | 2007-06-01 |       1 |
|    2 | 2007-06-02 |       2 |
|    3 | 2007-07-01 |       1 |
|    3 | 2007-07-02 |       2 |
+------+------------+---------+

This is about as efficient as the first method with a small table, but it compares every within-group row to every other within-group row. As the size N of a group increases, execution time increases by N2. If the query takes one minute for groups of 1,000, it will take 16 minutes for groups of 4,000, and more than four hours for groups for 16,000. The solution does not scale.

What to do? Forget GROUP BY! Manually assemble the desired query results in a temporary table from simple indexed queries, in this case, two rows per ID:

DROP TEMPORARY TABLE IF EXISTS earliers;
CREATE TEMPORARY TABLE earliers( id INT, entrydate DATE);
INSERT INTO earliers 
  SELECT id,entrydate FROM test WHERE id=1 ORDER BY entrydate LIMIT 2;
INSERT INTO earliers 
  SELECT id,entrydate FROM test WHERE id=2 ORDER BY entrydate LIMIT 2;
INSERT INTO earliers 
  SELECT id,entrydate FROM test WHERE id=3 ORDER BY entrydate LIMIT 2;

You need one INSERT statement per grouping value. To print the result, just query the earliers table:

SELECT * FROM earliers
ORDER BY id, entrydate;
+------+------------+
| id   | entrydate  |
+------+------------+
|    1 | 2007-05-01 |
|    1 | 2007-05-02 |
|    2 | 2007-06-01 |
|    2 | 2007-06-02 |
|    3 | 2007-07-01 |
|    3 | 2007-07-02 |
+------+------------+
DROP TEMPORARY TABLE earliers;

Most useful reports run again and again. If that's the case for yours, automate it in a stored procedure: using a cursor and a prepared statement, auto-generate an INSERT statement for every grouping value, and return the result:

DROP PROCEDURE IF EXISTS listearliers;
DELIMITER |
CREATE PROCEDURE listearliers()
BEGIN
  DECLARE curdone, vid INT DEFAULT 0;
  DECLARE idcur CURSOR FOR SELECT DISTINCT id FROM test;
  DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET curdone = 1;
  DROP TEMPORARY TABLE IF EXISTS earliers;
  CREATE TEMPORARY TABLE earliers( id INT, entrydate DATE);
  SET @sql = 'INSERT INTO earliers SELECT id,entrydate FROM test WHERE id=? ORDER BY  entrydate LIMIT 2';
  OPEN idcur;
  REPEAT
    FETCH idcur INTO vid;
    IF NOT curdone THEN
      BEGIN
        SET @vid = vid;
        PREPARE stmt FROM @sql;
        EXECUTE stmt USING @vid;
        DROP PREPARE stmt;
      END;
    END IF;
  UNTIL curdone END REPEAT;
  CLOSE idcur;
  SELECT * FROM earliers ORDER BY id,entrydate;
  DROP TEMPORARY TABLE earliers;
END;
|
DELIMITER ;
CALL listearliers();


Return to the Artful Common Queries page